原创 Don't assume anything (Part 2)

 2011-8-9 15:38  1937 11 11 分类: 消费电子

[Continued from Don't assume anything (Part 1)]

The Y2K problem taught us only one of two very important lessons in writing code: memory is cheap, and one should never clip the leading digits of the year. (Perhaps we haven't learned even that bit of pedagogy. The Unix Y2.038K bug still looms. I'm reassured that will be long after senility likely claims me, but I'm equally sure that in 1970, the two-digit-year crowd adopted the same "heck, it's a long way off" viewpoint. It's interesting we rail against CEOs' short-term gain philosophy, yet practice it ourselves. In both cases, the rewards are structured toward today rather than tomorrow.)

Y2K did not teach us what is one of the first lessons in computer science: check your goesintas and goesoutas. If the date rolls back to a crazy number, well, there's something wrong. Toss an exception, recompute, alert the user, do something! Even without complex problems like reentrancy and the like, lots of conditions can create crazy results. Divide by zero and C will return a result. The number will be complete garbage, but, if not caught, that value will then invariably be manipulated again and again by higher-level functions. It's like replacing a variable with a random number. Perhaps deep down inside an A/D driver a bit of normalization math experiences an overflow or a divide by zero. It propagates up the call chain, at each stage being further injected, inspected, detected, infected, neglected, and selected (as Arlo Guthrie would say). The completely bogus result might indicate how much morphine to push into the patient or how fast to set the cruise control.

Sure, these are safety-critical applications that might (emphasis on "might") have mitigation strategies to prevent such an occurrence, though Google returns 168,000 results to the keywords "FDA recall infusion pumps." But none of us can afford to tick off the customer in this hypercompetitive world.

My files are rife with photos of systems gone bad because developers didn't look for silly results. There are dozens of pictures of big digital clocks adorning banks and other buildings showing results like "505 degrees," or "-196 degrees." One would think anything over 150 or so is probably not right. And the coldest temperature ever recorded in nature on Earth is -129 degrees Fahrenheit at the Soviet's Antarctic research station (Vostok), which is not a bad lower limit for a sanity check.

Then there's the system that claimed power had been out for -470,423 hours, -21 minutes, and -16 seconds. A year ago Yahoo showed the Nasdaq crashing -16,000,000%, though—take heart!—the S&P 500 was up 19%. The Xdrive app showed -3GB of total disk space, but happily it let the user know 52GB was free. Ad-Aware complained the definitions were 730,488 days out of date (Jesus was a just a lad then). While trying to find out if the snow would keep me grounded, Southwest's site indicated my flight out of Baltimore was canceled but, mirabile dictu, it would land on-time in Chicago. Parking in the city is getting expensive; I have a picture of a meter with an $8 million tab. There's The NY Times' website, always scooping other newspapers, showing a breaking story released -1,283 minutes ago. A reader sent me a scan of his Fuddrucker's receipt: the bill was $5.49, he tendered $6.00, and the change is $5.99999999999996E-02. That's pretty darn close but completely absurd.

This is a very small sample of inane results from my files of dumb computer errors. All could have been prevented if developers adopted a mind-set that their assumptions are likely wrong, chief among those the notion that everything will be perfectly fine.

None of these examples endangered anyone; it's easy to dismiss them as non-problems. But one can be sure the customers were puzzled at best. Surely the subliminal takeaway is "this company is clueless." I do think we have polarized the industry into safety critical and non-safety critical camps, when another option is hugely important: mission critical. Correct receipts are mission-critical at Fuddruckers, and in the other examples I've cited "mission critical" includes not making the organization look like an utter fool.

The quote at the beginning of this article is the engineer's credo: we assume things will go wrong, in varied and complex ways. We identify and justify all rosy assumptions. We do worst case analysis to find those problem areas, and improve the design so the system is robust despite inevitable problems. All engineered systems have flaws; great engineering mitigates those flaws.

Software is topsy-turvy. A single error can wreak havoc. A mechanical engineer can beef up a beam to add margin; EEs use fuses and heavier wire to deal with unexpected stresses. The beam may bend without breaking; the wire might get hot but continue to function. Get one bit wrong in a program that encompasses 100 million and the system may crash. That's 99.999999% correctness, or a thousand times better than the Five Nines requirement, and about two orders of magnitude better than Motorola's vaunted Six Sigma initiative.

I'll conclude with a quote from the final report of the board that investigated the failure of Ariane 5 Flight 501. The inertial navigation system experienced an overflow (due to poor assumptions) and the exception handler (poorly engineered and tested) didn't take corrective action. "The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault."

Ah, if only the assumption that software is correct were true!