Alan Winfield's Web Log: Extreme debugging - a tale of microcode and an oven

Thursday, March 07, 2013

Extreme debugging - a tale of microcode and an oven

It's been quite awhile since I debugged a computer program. Too long. Although I miss coding, the thing I miss more is the process of finding and fixing bugs in the code. Especially the really hard-to-track-down bugs that have you tearing your hair out - convinced your code cannot possibly be wrong - that something else must be the problem. But then when you track down that impossible bug, it becomes so obvious.

I wanted to write here about the most fun I've ever had debugging code. And also the most bizarre, since fixing the bugs required the use of an oven. Yes, an oven. It turned out the bugs were temperature dependent.

But first some background. The year is 1986. I'm the co-founder of a university spin-out company in Hull, England, called Metaforth Ltd. The company was set up to commercialise a stack-based computer architecture that runs the language Forth natively. In other words Forth is the equivalent of the CPU's assembly language. Our first product was a 16-bit industrial processor which we called the MF1600. It was a 2-card module, designed to plug into the (then) industry standard VME bus. One of the cards was the Central Processing Unit (CPU) - not using a microprocessor, but a set of discrete components using fast Transistor Transistor Logic devices. The other card provided memory, input-output interfaces, and the logic needed to interface with the VME bus.

The MF1600 was fast. It ran Forth at 6.6 Million Forth Instructions Per Second (MIPS). Sluggish of course by today's standards, but in 1986 6.6 MIPS was faster than any microprocessor. Then PCs were powered by the state-of-the-art Intel 286 with a clock frequency of 6MHz, managing around 0.9 Assembler MIPS. And because Forth instructions are higher level than assembler, the speed differential was greater still when doing real work.

Ok, now to the epic debugging...

One of our customers reported that during extended tests in an industrial rack the MF1600 was mysteriously crashing. And crashing in a way we'd not experienced before when running tried and tested code. One of their engineers noted that their test rack was running very hot, almost certainly exceeding the MF1600's upper temperature limit of 55°C. Out of spec maybe, but still not good.

So we knew the problem was temperature related. Now any experienced electronics engineer will know that electrical signals take time to get from one place to another. It's called propagation delay, and these delays are normally measured in billionths of a second (nanoseconds). And propagation delays tend to increase with temperature. Like any CPU our MF1600 relies on signals getting to the right place at the right time. And if several signals have to reach the same place at the same time then even a small extra delay in one of them can cause major problems.

On most CPUs when each basic instruction is executed, a tiny program inside the CPU actually does the work of that instruction. Those tiny programs are called microcode. Here is a blog post from several years ago where I explain what microcode is. Microcode is magic stuff - it's the place where software and hardware meet. Just like any program microcode has to be written and debugged, but uniquely - when you write microcode - you have to take account of how long it takes to process and route signals and data across the CPU: 100nS from A to B; 120nS from C and D, and so on. So if the timing in any microcode is tight (i.e. only just allows for the normal delay and leaves no margin of error), it could result in that microcode program crashing at elevated temperatures.

So, we reckoned we had one, or possibly several, microcode programs in the MF1600 CPU with 'tight' timing. The question was, how to find them.

The MF1600 CPU had around 86 (Forth) instructions, and the timing bugs could be in any of them. Now testing microcode is very difficult, and the nature of the problem made the testing problem even worse. A timing problem at elevated temperatures means that testing the microcode by single-stepping the CPU clock and tracing the signals through the CPU with a logic analyser wouldn't help at all. We needed a way to efficiently identify the buggy instructions. Then we could worry about debugging them later. What we wanted was a way to test (i.e. exercise single instructions, one by one), on a running system at high temperatures.

Then we remembered that we don't need all 86 instructions to run the computer. Most of them can be emulated by putting together a set of simpler instructions. So a strategy formed: (1) write a set of tiny Forth programs that replace as many of the CPU instructions as possible, (2) recompile the operating system, then (3) hope that the CPU runs ok at high temperature. If it does then (4) run the CPU in an oven and one by one test the replaced instructions.

Actually it didn't take long to do steps (1) and (2), because the Forth programs already existed to express more complex instructions as sets of simpler ones. Many Forth systems on conventional microprocessor systems were built like that. In the end we had a minimal set of about 24 instructions. So, with the operating system recompiled and installed we put the CPU into the oven and switched on the heat. The system ran perfectly (but a little slower than usual), and continued to run well above the temperature it had previously crashed. A real stroke of luck.

Here's an example of a simple Forth instruction to replace two values on the stack with the smaller of those values, expressed as a Forth program we call MIN
: MIN OVER OVER > IF SWAP THEN DROP ;
(From my 1983 book The Complete Forth).

From then on it was relatively easy to run small test programs to exercise the other 62 instructions (which were of course still there in the CPU - just not used by the operating system). A couple of days work and we found the rogue 2 instructions that were crashing at temperature. They were - as you might have expected - rather complex instructions. One was (LOOP) an instruction for do loops.

Then debugging those instructions simply required studying the microcode and the big chart with all the CPU delay times, over several pots of coffee. Knowing (or strongly suspecting) that what we were looking for were timing problems, called race hazards, where the data from one part of the CPU just doesn't have time to get to another part in time to be used for the next step of the microcode program. Having identified the suspect timing I then re-wrote the microcode for those instructions to leave a bit more time - by adding one clock cycle to each instruction (50nS).

Then reverting to the old non-patched operating system, it was the moment of truth. Back in the oven, cranking up the temperature, while the CPU was running test programs specifically designed to stress those particular instructions. Yes! The system didn't crash at all, over several days of running at temperature. I recall pushing the temperature above 100°C. Components on the CPU circuit board were melting, but still it didn't crash.

So that's how we debugged code with an oven.

25 comments:

pjtMarch 09, 2013 7:55 am
Great story. Thanks for this.

I remember *Complete Forth*, and I loved the idea of the language, but the Forth implementation I got for ZX Spectrum was really quite painful to use because everything concentrated on the language itself, not the operability. It wasn't very usable, or then the instructions weren't good enough for me to pick up.
ReplyDelete
Replies
UnknownMarch 09, 2013 2:05 pm
Thank you for such a great article. Fascinating. I have so many good memories of FORTH, I was spoiled and never really worked well with more elaborated obj. oriented languages. Amazing, so glad I found your blog.
ReplyDelete
Replies
Stefan HolmMarch 09, 2013 4:43 pm
Thank you for a really excellent article.

I'm sure I would have loved Forth, as "thinking in stacks" came pretty natural having used and programmed HP RPN calculators, but the lack of a good IDE on my system at the time made it a pain, so it was too brief an encounter.
Your description of microcode as the magic that connects hardware and software is spot on. Learning about microcode and writing a minimal set of instructions was an eye opener for me, and necessary to really understand how a computer works.
Thank you again!
ReplyDelete
Replies
AnonymousMarch 09, 2013 10:23 pm
Hi. Old FORTHy here. FORTH has been my debug embedded tool of choice since about that same time, still have it embedded in my current C code. Very sharp tool - double-edged sword with no handle - concommittent of power. Though I'm moving to Lua now since I need other people working with me.

I have a practical question for your then-self:

Why did you spend all that effort on building your own chip when there was the Harris chip available at that time? Maybe not quite as fast in cycle clock, but encoded up to 3 words per instruction. We had it running a real time rendered valley scene with sunrise through sunset shading changes. With the video bit-banged in code. As its wallpaper.
ReplyDelete
Replies
AnonymousMarch 10, 2013 3:03 am
Great story. If someone made a TV series about engineers solving problems like this, I'd watch it. Much more interesting than a cop or lawyer show.
ReplyDelete
Replies
♪March 10, 2013 8:50 pm
I felt with a similar problem a couple weeks back, except that the chip refused to respond unless heated to 100C or so. Turns out that during the soldering process(we were using that board to learn how to solder BGA parts) the part had been overheated and warped ever so slightly, and only took on the right shape when hot.

We decided to use leaded solder balls after that.
ReplyDelete
Replies
CZMorrisMarch 11, 2013 2:15 pm
Great story. It is amazing the methods we use for debugging when we really want to solve a problem. Spark (EMI) generators, heating and cooling. I love doing test setups to recreate that "one in a million" bug that just so happens to have occurred at a very important customers site.
ReplyDelete
Replies
Dan SuttonMarch 11, 2013 3:48 pm
Wow - I remember you guys -- a few of us were seriously into FORTH, what with the Jupiter Ace and so on: when we heard about a chip running it natively, we were excited beyond imagination - it's so nice to hear a story like this from back in the days...
ReplyDelete
Replies
UnknownMarch 11, 2013 3:57 pm
I actually built a signal processing system using AMD bit-slice processors, and had to write quite a bit of microcode. Great story! (This to get my PhD at CMU's Robotics Institute, by the way.)

I have one for you. Recently, I was using at home an "ethernet over AC wiring" system. It allowed me to have a network over several rooms without relying on WiFi to do communication. My wife and I remodeled a small area, and ended up adding a dimmer switch for new lights. That area has some electronic equipment.

The electronic equipment would work fine during the day, then just stop altogether at night. Finally, I realized that, to look at what was going on at night, I had to turn on the lights. Lo and behold, the dimmer was interfering with the AC ethernet system! I finally was able to take a laptop, hook it up to an Internet speed test site, and move the speed up or down using the dimmer!

The solution was to replace all the dimming circuitry with DC instead of AC switches and lights.

I love debugging. All the best. Rafael.
ReplyDelete
Replies
AnonymousMarch 11, 2013 3:59 pm
Forth reminds me my youth where I did programming and learned a lot of things that I later didn't need anymore. It was a happy time being able to count machine cycles...
ReplyDelete
Replies
ResunaMarch 11, 2013 4:31 pm
Did making (loop) and other complex branch constructs single instructions really save that much?
ReplyDelete
Replies
gullygunyahMarch 11, 2013 10:23 pm
I'm not at all familiar with Forth or what it means for it to be stack based.( I thought the stack was important in most languages)
But I do love a good debugging story and this was an interesting scenario described. Thanks.
ReplyDelete
Replies
AnonymousMarch 12, 2013 2:28 am
..and why was the test rack running so hot? :) Just kidding.. great story. Whenever you think you've got it rough, just imagine Charles Babbage doing arithmetic on a mechanical computer the size of your bathroom.
ReplyDelete
Replies
AnonymousMarch 12, 2013 10:07 am
One time I was trying to get an optical fiber network to work right. Every so often, around 4 pm, it wouldn't. Took a while, varied from box to box but not the time of day. Finally realized it was the red dust covers over the second relay connector set. It was a nice early spring in Redondo Beach and we tended to have the lab door open for the evening breeze, and the sun was shining through them and putting a DC offset on the pipes. Switched to all black covers and the problem went away.

Lots of stories. Lots.
ReplyDelete
Replies
UnknownMarch 14, 2013 1:11 pm
Great history
ReplyDelete
Replies
Tony MachMarch 30, 2013 8:21 am
Thank you for sharing this great story! I think it is through such debugging sessions, trying to find such non-trivial bugs, that one gets a much "deeper" understanding of the entire system.
ReplyDelete
Replies
CC_LogicApril 04, 2013 11:44 pm
Great article Alan! Inspired me to blog about another odd debugging moment: [http://cc-logic.com/blog/2013/4/4/extreme-debugging-with-a-rental-car]
ReplyDelete
Replies
IanNovember 02, 2014 8:20 pm
(Ah, are comments only visible after being approved? If so, the last three from 2013 shouldn't have been - they're comment link spam.)
ReplyDelete
Replies

Add comment

Pages

Thursday, March 07, 2013

Extreme debugging - a tale of microcode and an oven

25 comments: