热度 13
2011-6-29 11:39
1628 次阅读|
0 个评论
Almost a decade ago I wrote a series of three articles about watchdog timers (WDT). At the time, it was my argument that most WDTs were poorly designed. Too many still are. 1 I won't repeat my arguments here since they're available online. However, I stated that a WDT is the last line of defense against product failure. Designed correctly, the system is guaranteed to recover from a crash; anything else may result in a ticked-off customer. Or loss of an expensive mission. Or, for the unfortunate users, injury and death. Remember that when a program crashes, it generally runs off and starts executing code from some random address. Rarely does the application actually stop; if it does stop, that's usually only after it executes a lot of incorrect instructions. So what's the state of the art today? A complete survey is impossible, but here are a few data points. Texas Instruments' MSP430 family is composed of a wide range of nifty very low power 16bit microcontrollers. The documentation shows a very impressive-looking WDT block diagram (see Figure 13-1 in TI's PDF), but the reality is less thrilling. 2 At any time, the code can turn the protection mechanism off, which means a crashed program running rogue code can issue an instruction that disables the WDT. The system crashes and never recovers. The MSP430's instruction set is refreshingly simple, generally using a single word to represent, for instance, a MOV instruction. A nice feature is that to change any WDT setting one uses a MOV with the upper byte set to 0x5a; the lower 8 bits contain the command. Try to write to it with anything else in the MSB and the system will reset. The lower bits have a variety of configuration information that disables/enables the WDT, selects the clock source, or even switches the WDT to act as a simple timer. What are the odds that crashed code will issue exactly a 0x5a in the upper byte? Pretty slim, of course. But not zero, and probably a lot less than zero. A lot of move instructions will be in the code, and some will be followed by an ADD, which is what 0x5a represents. Much better would be a configuration that allows one to access the WDT configuration only once. Or perhaps only once after resuming from a low-power mode. Then there's Freescale's newish 32bit Coldfire+ line, like the MCF51Qx. 3 Instead of "watchdog," Freescale prefers the awkward phrase "Computer Acting Properly" (COP). But it does offer a very intriguing feature. In general, one pets the watchdog, uh, COP, by writing 0x55 and then 0xaa to the control register. But in one mode, that sequence must be sent in the last 25% of the COP timeout period. A premature write results in a reset. Odds of an errant program getting the timing Goldilocks-correct (not too often, nor too infrequently) are tiny. The part also generates a reset if any attempt is made to execute an illegal instruction. That's somewhat different from most CPUs, which issue an illegal op-code interrupt. I rather like Freescale's approach, since interrupt handlers are not guaranteed to work if the code crashes. A blown stack, corrupt PC (on some CPUs if the PC is odd, a fault is taken), or a vector base register changes. This also suggests that it's a good idea to fill unused flash at link time with an illegal op code, and on power-up fill all of RAM with a similar instruction, so that errant code waltzing through memory is likely to generate a reset. Another nice touch is that the reset pin is open drain and is asserted when any of these errors occur. Tie it to the peripheral reset inputs. Even if wandering code issues output instructions their potentially scrambled little brains will be straightened out. STMicroelectronics has a line of Cortex-M3 devices. The M3 has become extremely popular for lower-end embedded devices, and ST's STM32F is representative of these parts (although the WDT is an ST add-on and does not necessarily mirror other vendors' implementations). The STM32F has two different protection mechanisms. An "Independent Watchdog" is a pretty vanilla design that has little going for it other than ease of use. But their Window Watchdog offers more robust protection. When a countdown timer expires, a reset is generated, which can be impeded by reloading the timer. Nothing special there. But if the reload happens too quickly, the system will also reset. In this case "too quickly" is determined by a value one programs into a control register. Another cool feature: it can generate an interrupt just before resetting. Write a bit of code to snag the interrupt and you can take some action to, for instance, put the system in a safe state or to snapshot data for debugging purposes. ST suggests using the interrupt service routine (ISR) to reload the watchdog—that is, kick the dog so a reset does not occur. Don't take their advice. If the program crashes, the interrupt handlers may very well continue to function normally. And using an ISR to reload the WDT invalidates the entire reason for a window watchdog. The WDT cannot be disabled once enabled—good thinking, folks! But oddly, the other configuration registers can be changed at will, which can make the watchdog behave incorrectly. A novel approach The latest issue of IEEE Embedded Systems Letters has an article that practically grabbed me by the throat. 4 Titled "Control Focused Soft Error Detection for Embedded Applications" by Karthik Shankar and Roman Lysecky, it's a bit academic and rather a slog to get through. But the authors have come up with a fascinating twist on the concept of watchdog timers. In fact, they don't use the word "watchdog," and no timers are involved. The idea is quite simple and at first glance not particularly novel: monitor the addresses the processor issues and compare those against a profile obtained during development. If the CPU goes to an unexpected location, take some sort of remedial action. Do the same if it doesn't issue an expected address. The authors go further and compare against the number of expected loop iterations and the like. But what got my attention is how they monitor addresses. They simply suck in compressed trace data from the processor's serial trace data port. The paper talks about using an ARM CPU, but other parts also support various kinds of serial trace data, and there's even a standard named Nexus-5001, which is IEEE–ISTO 5001–2003. 5