Showing posts with label electronics. Show all posts
Showing posts with label electronics. Show all posts

Thursday, March 07, 2013

Extreme debugging - a tale of microcode and an oven

It's been quite awhile since I debugged a computer program. Too long. Although I miss coding, the thing I miss more is the process of finding and fixing bugs in the code. Especially the really hard-to-track-down bugs that have you tearing your hair out - convinced your code cannot possibly be wrong - that something else must be the problem. But then when you track down that impossible bug, it becomes so obvious.

I wanted to write here about the most fun I've ever had debugging code. And also the most bizarre, since fixing the bugs required the use of an oven. Yes, an oven. It turned out the bugs were temperature dependent.

But first some background. The year is 1986. I'm the co-founder of a university spin-out company in Hull, England, called Metaforth Ltd. The company was set up to commercialise a stack-based computer architecture that runs the language Forth natively. In other words Forth is the equivalent of the CPU's assembly language. Our first product was a 16-bit industrial processor which we called the MF1600. It was a 2-card module, designed to plug into the (then) industry standard VME bus. One of the cards was the Central Processing Unit (CPU) - not using a microprocessor, but a set of discrete components using fast Transistor Transistor Logic devices. The other card provided memory, input-output interfaces, and the logic needed to interface with the VME bus.

The MF1600 was fast. It ran Forth at 6.6 Million Forth Instructions Per Second (MIPS). Sluggish of course by today's standards, but in 1986 6.6 MIPS was faster than any microprocessor. Then PCs were powered by the state-of-the-art Intel 286 with a clock frequency of 6MHz, managing around 0.9 Assembler MIPS. And because Forth instructions are higher level than assembler, the speed differential was greater still when doing real work.

Ok, now to the epic debugging...

One of our customers reported that during extended tests in an industrial rack the MF1600 was mysteriously crashing. And crashing in a way we'd not experienced before when running tried and tested code. One of their engineers noted that their test rack was running very hot, almost certainly exceeding the MF1600's upper temperature limit of 55°C. Out of spec maybe, but still not good.

So we knew the problem was temperature related. Now any experienced electronics engineer will know that electrical signals take time to get from one place to another. It's called propagation delay, and these delays are normally measured in billionths of a second (nanoseconds). And propagation delays tend to increase with temperature. Like any CPU our MF1600 relies on signals getting to the right place at the right time. And if several signals have to reach the same place at the same time then even a small extra delay in one of them can cause major problems.

On most CPUs when each basic instruction is executed, a tiny program inside the CPU actually does the work of that instruction. Those tiny programs are called microcode. Here is a blog post from several years ago where I explain what microcode is. Microcode is magic stuff - it's the place where software and hardware meet. Just like any program microcode has to be written and debugged, but uniquely - when you write microcode - you have to take account of how long it takes to process and route signals and data across the CPU: 100nS from A to B; 120nS from C and D, and so on. So if the timing in any microcode is tight (i.e. only just allows for the normal delay and leaves no margin of error), it could result in that microcode program crashing at elevated temperatures.

So, we reckoned we had one, or possibly several, microcode programs in the MF1600 CPU with 'tight' timing. The question was, how to find them.

The MF1600 CPU had around 86 (Forth) instructions, and the timing bugs could be in any of them. Now testing microcode is very difficult, and the nature of the problem made the testing problem even worse. A timing problem at elevated temperatures means that testing the microcode by single-stepping the CPU clock and tracing the signals through the CPU with a logic analyser wouldn't help at all. We needed a way to efficiently identify the buggy instructions. Then we could worry about debugging them later. What we wanted was a way to test (i.e. exercise single instructions, one by one), on a running system at high temperatures.

Then we remembered that we don't need all 86 instructions to run the computer. Most of them can be emulated by putting together a set of simpler instructions. So a strategy formed: (1) write a set of tiny Forth programs that replace as many of the CPU instructions as possible, (2) recompile the operating system, then (3) hope that the CPU runs ok at high temperature. If it does then (4) run the CPU in an oven and one by one test the replaced instructions.

Actually it didn't take long to do steps (1) and (2), because the Forth programs already existed to express more complex instructions as sets of simpler ones. Many Forth systems on conventional microprocessor systems were built like that. In the end we had a minimal set of about 24 instructions. So, with the operating system recompiled and installed we put the CPU into the oven and switched on the heat. The system ran perfectly (but a little slower than usual), and continued to run well above the temperature it had previously crashed. A real stroke of luck.

Here's an example of a simple Forth instruction to replace two values on the stack with the smaller of those values, expressed as a Forth program we call MIN
: MIN  OVER OVER > IF SWAP THEN DROP ;
(From my 1983 book The Complete Forth).

From then on it was relatively easy to run small test programs to exercise the other 62 instructions (which were of course still there in the CPU - just not used by the operating system). A couple of days work and we found the rogue 2 instructions that were crashing at temperature. They were - as you might have expected - rather complex instructions. One was (LOOP) an instruction for do loops.

Then debugging those instructions simply required studying the microcode and the big chart with all the CPU delay times, over several pots of coffee. Knowing (or strongly suspecting) that what we were looking for were timing problems, called race hazards, where the data from one part of the CPU just doesn't have time to get to another part in time to be used for the next step of the microcode program. Having identified the suspect timing I then re-wrote the microcode for those instructions to leave a bit more time - by adding one clock cycle to each instruction (50nS).

Then reverting to the old non-patched operating system, it was the moment of truth. Back in the oven, cranking up the temperature, while the CPU was running test programs specifically designed to stress those particular instructions. Yes! The system didn't crash at all, over several days of running at temperature. I recall pushing the temperature above 100°C. Components on the CPU circuit board were melting, but still it didn't crash.

So that's how we debugged code with an oven.

Saturday, April 21, 2012

What's wrong with Consumer Electronics?

When I was a boy the term consumer electronics didn't exist. Then the sum total of household electronics was a wireless, a radiogram and a telephone; pretty much everyone had a wireless, fewer a radiogram and on our (lower middle-class) street perhaps one in five houses had a telephone. (In an emergency it was normal to go round to the neighbour with the phone.) In the whole of my childhood we only ever had the same wireless set and gramophone and both looked more like furniture than electronics, housed in handsome polished wooden cabinets. Of course it was their inner workings, with the warm yellow glow of the thermionic valves that fascinated me and got me into trouble when I took them to pieces, that led to my chosen career in electronics.

How things have changed. Now most middle-class households have more computing power than existed in the world 50 years ago. Multiple TVs, mobile phones, computing devices (laptops, games consoles, iPads, Kindles and the like) and the supporting infrastructure of wireless routers, printer, and backup storage, are now normal. And most of this stuff will be less than five years old. If you're anything like me the Hi-Fi system will be the oldest bit of kit you own (unless you ditched it for the iPod and docking station). Of course this gear is wonderful. I often find myself shocked by the awesomeness of everyday technology. And understanding how it all works only serves to deepen my sense of awe. But, I'm also profoundly worried - and offended too - by the way we consume our electronics.

What offends me is this: modern solid-state electronics is unbelievably reliable - what's wrong with consumer electronics is nothing, yet we treat this magical stuff - fashioned of glass - as stuff to be consumed then thrown away. Think about the last time you replaced a gadget because the old one had worn out or become unrepairable. Hard isn't it. If you still possessed it the mobile phone you had 15 years ago would - I'd wager - still work perfectly. I have a cupboard here at home with all manner of obsolete kit. A dial-up modem for instance, circa 1993. It still works fine - but there's nothing to dial into. The fact is that we are compelled to replace perfectly good nearly-new electronics with the latest model either because the old stuff is rendered obsolete (because it's no longer compatible with current generation o/s, or applications or infrastructure - or unsupported), or worse still because the latest kit has 'must have' features or capabilities not present on the old.

I would like to see a shift in consumer electronics back to a model in which gadgets are designed to be repaired and consumers are encouraged to replace or upgrade every ten years or more, not every year. What I'm suggesting is of course exactly the opposite of what's happening now. Current devices are becoming less repairable, with batteries you can't replace and designs that even skilled technicians find difficult to take apart without risk of damage. The lastest iPad for example was given a very low repairability score (2/10) by iFixit.

And the business model most electronics companies operate is fixated on the assumption that profit, and growth, can only be achieved through very short product life cycles. But all of our stuff is not like this. We don't treat our houses, or gardens, or dining room tables, or central heating systems, or any number of things as consumer goods, but the companies that build and sell houses, or dining room tables, or landscape gardens, etc, still turn a profit. Why can't electronics companies find a business model that treats electronic devices more like houses and less like breakfast cereal?

I don't think consumer electronics should be consumed at all.