It's been quite awhile since I debugged a computer program. Too long. Although I miss coding, the thing I miss more is the process of finding and fixing bugs in the code. Especially the really hard-to-track-down bugs that have you tearing your hair out - convinced your code cannot possibly be wrong - that something else must be the problem. But then when you track down that impossible bug, it becomes so obvious.
I wanted to write here about the most fun I've ever had debugging code. And also the most bizarre, since fixing the bugs required the use of an oven. Yes, an oven. It turned out the bugs were temperature dependent.
But first some background. The year is 1986. I'm the co-founder of a university spin-out company in Hull, England, called Metaforth Ltd. The company was set up to commercialise a stack-based computer architecture that runs the language Forth natively. In other words Forth is the equivalent of the CPU's assembly language. Our first product was a 16-bit industrial processor which we called the MF1600. It was a 2-card module, designed to plug into the (then) industry standard VME bus. One of the cards was the Central Processing Unit (CPU) - not using a microprocessor, but a set of discrete components using fast Transistor Transistor Logic devices. The other card provided memory, input-output interfaces, and the logic needed to interface with the VME bus.
The MF1600 was fast. It ran Forth at 6.6 Million Forth Instructions Per Second (MIPS). Sluggish of course by today's standards, but in 1986 6.6 MIPS was faster than any microprocessor. Then PCs were powered by the state-of-the-art Intel 286 with a clock frequency of 6MHz, managing around 0.9 Assembler MIPS. And because Forth instructions are higher level than assembler, the speed differential was greater still when doing real work.
Ok, now to the epic debugging...
One of our customers reported that during extended tests in an industrial rack the MF1600 was mysteriously crashing. And crashing in a way we'd not experienced before when running tried and tested code. One of their engineers noted that their test rack was running very hot, almost certainly exceeding the MF1600's upper temperature limit of 55°C. Out of spec maybe, but still not good.
So we knew the problem was temperature related. Now any experienced electronics engineer will know that electrical signals take time to get from one place to another. It's called propagation delay, and these delays are normally measured in billionths of a second (nanoseconds). And propagation delays tend to increase with temperature. Like any CPU our MF1600 relies on signals getting to the right place at the right time. And if several signals have to reach the same place at the same time then even a small extra delay in one of them can cause major problems.
On most CPUs when each basic instruction is executed, a tiny program inside the CPU actually does the work of that instruction. Those tiny programs are called microcode. Here is a blog post from several years ago where I explain what microcode is. Microcode is magic stuff - it's the place where software and hardware meet. Just like any program microcode has to be written and debugged, but uniquely - when you write microcode - you have to take account of how long it takes to process and route signals and data across the CPU: 100nS from A to B; 120nS from C and D, and so on. So if the timing in any microcode is tight (i.e. only just allows for the normal delay and leaves no margin of error), it could result in that microcode program crashing at elevated temperatures.
So, we reckoned we had one, or possibly several, microcode programs in the MF1600 CPU with 'tight' timing. The question was, how to find them.
The MF1600 CPU had around 86 (Forth) instructions, and the timing bugs could be in any of them. Now testing microcode is very difficult, and the nature of the problem made the testing problem even worse. A timing problem at elevated temperatures means that testing the microcode by single-stepping the CPU clock and tracing the signals through the CPU with a logic analyser wouldn't help at all. We needed a way to efficiently identify the buggy instructions. Then we could worry about debugging them later. What we wanted was a way to test (i.e. exercise single instructions, one by one), on a running system at high temperatures.
Then we remembered that we don't need all 86 instructions to run the computer. Most of them can be emulated by putting together a set of simpler instructions. So a strategy formed: (1) write a set of tiny Forth programs that replace as many of the CPU instructions as possible, (2) recompile the operating system, then (3) hope that the CPU runs ok at high temperature. If it does then (4) run the CPU in an oven and one by one test the replaced instructions.
Actually it didn't take long to do steps (1) and (2), because the Forth programs already existed to express more complex instructions as sets of simpler ones. Many Forth systems on conventional microprocessor systems were built like that. In the end we had a minimal set of about 24 instructions. So, with the operating system recompiled and installed we put the CPU into the oven and switched on the heat. The system ran perfectly (but a little slower than usual), and continued to run well above the temperature it had previously crashed. A real stroke of luck.
Here's an example of a simple Forth instruction to replace two values on the stack with the smaller of those values, expressed as a Forth program we call MIN
: MIN OVER OVER > IF SWAP THEN DROP ;
(From my 1983 book The Complete Forth).
From then on it was relatively easy to run small test programs to exercise the other 62 instructions (which were of course still there in the CPU - just not used by the operating system). A couple of days work and we found the rogue 2 instructions that were crashing at temperature. They were - as you might have expected - rather complex instructions. One was (LOOP) an instruction for do loops.
Then debugging those instructions simply required studying the microcode and the big chart with all the CPU delay times, over several pots of coffee. Knowing (or strongly suspecting) that what we were looking for were timing problems, called race hazards, where the data from one part of the CPU just doesn't have time to get to another part in time to be used for the next step of the microcode program. Having identified the suspect timing I then re-wrote the microcode for those instructions to leave a bit more time - by adding one clock cycle to each instruction (50nS).
Then reverting to the old non-patched operating system, it was the moment of truth. Back in the oven, cranking up the temperature, while the CPU was running test programs specifically designed to stress those particular instructions. Yes! The system didn't crash at all, over several days of running at temperature. I recall pushing the temperature above 100°C. Components on the CPU circuit board were melting, but still it didn't crash.
So that's how we debugged code with an oven.