原创 How I test software, again and again (Part 1)

 2011-3-22 16:48  3146 29 29 分类: 消费电子

This is the continuation of my article, "How I Test Software". We have a lot to cover this time, so let's get right to it. I'll begin with some topics that don't fit neatly into the mainstream flow.

Test whose software?
As with my previous article, "How I Test Software", I need to make clear that the chosen pronoun in the title is the right one. This is not a tutorial on software testing methodologies. I'm describing the way I test my software.

Does this mean that I only write software solo and for my own use? Not at all. I've worked on many large projects. But when I do, I try to carve out a piece that I can get my arms around and develop either alone or with one or two like-minded compadres. I've said before that I usually find myself working with folks that don't develop software the same way I do. I used to think I was all alone out here, but you readers disabused me of that notion. Even so, my chances of finding a whole project team who work the way I do are slim and none. Which is why I try to carve out a one-person-sized piece.

What's that you say? Your project is too complicated to be divided up? Have you never heard of modular programming? Information hiding? Refactoring? Show me a project whose software can't be divided up, and I'll show you a project that's doomed to failure.

The classical waterfall method has a box called "Code and Unit Test." If it makes you feel better, you can put my kind of testing into that pigeonhole. But it's more than that, really. There's a good reason why I want to test my software, my way: my standards are higher than "theirs."

Remember my discussion about quality? To some folks, the term "quality" has come to mean "just good enough to keep from getting sued." I expect more than that from my software. I don't think it's enough that it works well enough for the company to get their fee. I think it should not only run, but give the right answer.

I'm funny that way. It doesn't always endear me to my managers, who sometimes just want to ship anything, whether it works or not. Sometimes I want to keep testing when they wish I wouldn't. But regardless of the quality of the rest of the software, I try very hard to make mine bulletproof. Call it pride if you like or even hubris. I'm still going to test thoroughly.

Oddly enough, the reason I spend so much time testing is, I'm lazy. I truly hate to debug; I hate to single-step; I hate to run hand checks. But there's something I hate even more, and that's having to come back again, later, and do it all over again. The reason I test so thoroughly is, I don't ever want to come back this way again.

Desk checking
All my old software engineering textbooks used to say that the most effective method to reduce software errors was by desk checking. Personally, I've never found it to be very effective.

I assume you know the theory of desk checking. When a given programmer is working on a block of code, he may have been looking at the code so long, he doesn't see the obvious errors staring him in the face. Forest for the trees, and all that stuff. A fresh pair of eyes might see those errors that the programmer has become blind to.

Maybe it's so, but I find that the folks that I'd trust to desk check my code, and do it diligently, are much too much in demand to have time to do it. And what good is it to lean on a lesser light? If I have to spend all day explaining the code to him, what good is that?

But there's another, even more obvious reason to skip desk checking: it's become an anachronism. Desk checking may have been an effective strategy in one of those projects where computer time was so precious, no one was allowed to actually compile their code. But it makes no sense at all, today.

Face it, unless the checker is a savant who can solve complicated math equations in his head, he's not going to know whether the math is being done right or not. At best, he can only find syntax errors and typos. But today, there's a much more effective way to find those errors: let the compiler find them for you. When it does, you'll find yourself back in the editor with the cursor blinking at the location of the error. And it will do it really fast. With tools like that, who needs the fellow in the cubie next door?

Single stepping
I have to admit it: I'm a single-stepping junkie. When I'm testing my software, I tend to spend a lot of time in the source-level debugger. I will single-step every single line of code. I will check the result of every single computation. If I have to come back later and test again, I will, many times, do the single-stepping thing all over again.

My pal Jim Adams says I'm silly to do that. If a certain statement did the arithmetic correctly the first time, he argues, it's not going to do it wrong later. He's willing to grant me the license to single step one time through the code, but never thereafter.

I suppose he's right, but I still like to do it anyway. It's sort of like compiling the null program. I know it's going to work (at least, it had better!), but I like to do it anyway, because it puts me in the frame of mind to expect success rather than failure. Not all issues associated with software development are cold, hard, rational facts. Some things revolve around one's mindset. When I single-step into the code, I find myself reassured that the software is still working, the CPU still remembers how to add two numbers, and the compiler hasn't suddenly started generating bad instructions.

Hey, remember: I've worked with embedded systems for a long time. I've seen CPU's that didn't execute the instructions right. I've seen some that did for a time, but then stopped doing it right. I've seen EPROM-based programs whose bits of 1's and 0's seemed to include a few ½'s.

I find that I work more effectively once I've verified that the universe didn't somehow become broken when I wasn't looking. If that takes a little extra time, so be it.

Random-number testing
This point is one dear to my heart. In the past, I've worked with folks who like to test their software by giving it random inputs. The idea is, you set the unit under test (UUT) up inside a giant loop and call it a few million times, with input numbers generated randomly (though perhaps limited to an acceptable range). If the software doesn't crash, they argue, it proves that it's correct.

No it doesn't. It only proves that if the software can run without crashing once, it can do it a million times. But did it get the right answer? How can you know? Unless you're going to try to tell me that your test driver also computes the expected output variables, and verifies them, you haven't proven anything. You've only wasted a lot of clock cycles.

Anyhow, why the randomness? Do you imagine that the correctness of your module on a given call depends on what inputs it had on the other 999,999 calls? If you want to test the UUT with a range of inputs, you can do that. But why not just do them in sequence? As in 0.001, 0.002, 0.003, etc.? Does anyone seriously believe that shuffling the order of the inputs is going to make the test more rigorous?

There's another reason it doesn't work. As most of us know, if a given software module is going to fail, it will fail at the boundaries. Some internal value is equal to zero, for example.

But zero is the one value you will never get from a random-number generator (RNG). Most such generators work by doing integer arithmetic that causes an overflow and then capturing the lower-order bits. For example, the power residue method works by multiplying the last integer by some magic number, and taking the lower-order bits of the result. In such a RNG, you will never see an integer value of zero. Because if it ever is equal to zero, it will stay zero for the rest of the run.

Careful design of the RNG can eliminate this problem, but don't count on it. Try it yourself. Run the RNG in your favorite compiler and see how often you get a floating point value of

0.000000000000e000.
End result? The one set of values that are most likely to cause trouble—that is, the values on the boundaries—is the one set you can never have.

If you're determined to test software over a range of inputs, there's a much better way to do that, which I'll show you in a moment.

Validation?
All my career, I've had people tell me their software was "validated." On rare occasions, it actually is, but usually not. In my experience, the claim is often used to intimidate. It's like saying, "Who do you think you are, to question my software, when the entire U.S. Army has already signed off on it?"

Usually, the real situation is, the fellow showed his Army customer one of his outputs, and the customer said, "Yep, it looks good to me." I've worked with many "validated" systems like that and found hundreds of egregious errors in them.

Even so, software absolutely can be validated and should be. But trust me, it's not easy.

A lot of my projects involve dynamic simulations. I'm currently developing a simulator for a trip to the Moon. How do I validate this program? I sure can't actually launch a spacecraft and verify that it goes where the program said it would.

There's only one obvious way to validate such software, and that's to compare its results with a different simulation, already known to be validated itself. Now, when I say compare the results, I mean really compare them. They should agree exactly, to within the limits of floating-point arithmetic. If they don't agree, we have to go digging to find out why not.

In the case of simulations, the reasons can be subtle. Like slightly different values for the mass of the Earth. Or its J2 and J4 gravity coefficients. Or the mass of its atmosphere. Or the mass ratio of the Earth/Moon system. You can't even begin to run your validation tests until you've made sure all those values are identical. But if your golden standard is someone else's proprietary simulation, good luck getting them to tell you the constants they use.

Even if all the constants are exactly equal and all the initial conditions are identical, it's still not always easy to tell that the simulations agree. It's not enough to just look at the end conditions. Two simulations only agree if their computed trajectory has the same state values, all the way from initial to final time.

Now, a dynamic simulation will typically output its state at selected time steps—but selected by the numerical integrator way down inside, not the user. If the two simulations are putting out their data at different time hacks, how can we be sure they're agreeing?

Of course, anything is possible if you work at it hard enough, and this kind of validation can be done. But, as I said, it's not easy, and not for the faint-hearted.

Fortunately, there's an easier and more elegant way. The trick is to separate the numerical integration scheme from the rest of the simulation. You can test the numerical integrator with known sets of differential equations, that have known solutions. If it gets those same solutions, within the limits of floating-point arithmetic, it must be working properly.

Now look at the model being simulated. Instead of comparing it to someone else's model, compare its physics with the real world. Make sure that, given a certain state of the system, you can calculate the derivatives of the states and verify that they're what the physics says they should be.

Now you're done, because if you've already verified the integrator, you can be sure that it will integrate the equations of motion properly and get you to the right end point.

Is this true validation? I honestly don't know, but it works for me.

[Continued at How I test software, again and again (Part 2)]