热度 15
2015-8-21 18:20
1733 次阅读|
0 个评论
As a technology journalist, I get the equivalent of a postgraduate education reading good, detailed papers and reports, though some are mind-numbingly difficult. Knowing my interest in such papers, several software developers and tool vendors have independently referred me to "Quantitative Evaluation of Static Analysis Tools," by Shin’ichi Shirashi, Veena Mohan and Hemalatha Marimuthu, at the Toyota InfoTechnology Center in Mountain View, Ca. In this paper, the authors describe the task of selecting the optimum code analysis tool for doing run time code analysis of software for use in Toyota's vehicles. Starting with tools from about 170 vendors of proprietary tools as well as a range of free and open source versions, they narrowed their choices down to six, those from Coverity, GrammaTech, PRQA, MathWorks, Monoidics and Klocworks to make their selections they used a complex methodology to first make their assessment and from that also derive a set of coding guidelines to help the company's development teams avoid defects proactively. Readers of the report may disagree with the types of tests and metrics the Toyota team used to make its choices. Some developers I have talked to complain that despite the data-driven quantitative approach, at the beginning of their efforts the team made use of the qualitative and subjective judgements of a few experts they trusted. However given the number of such tools they had to evaluate, that was a choice they were almost forced to make. Even after limiting their choices in that way, there were many alternatives to evaluate and test, even after they narrowed their choices further by excluding noncommercial tools that provided no technical support, those that did not support safety critical applications, and including only those that supported the C language. The paper describes how the Toyota team first created a set of test suites incorporating a wide variety of software defects that might cause problems in safety critical applications. They then tested the tools they had selected against that suite. Finally, because of the importance of driving buggy software out of the automobile environment, they went one step further: they used the data already collected to create several new metrics to further hone down the performance of the tools. For information on the various tests they used, they depended on several reports from the U.S. National Institute of Standards and Technology (NIST), supplemented with information from various tool vendors and users. The choices they made and the process they came up with made it clear how many things can go wrong in a software design, especially a safety critical one, and how hard it is to pin things down. Drawing on every piece of literature they could find, they identified eight defect types that were important, including: static and dynamic memory, resource management, pointer-related, and concurrency defects as well as the use of inappropriate and dead code. From that they identified 39 defect sub-types, and from those they created 841 variations that they used in their tests. The methodology is about as comprehensive as any I have ever seen. In addition to the defect code bases they created for their test suites, the researchers created an identical set without defects. The first code base was for evaluating true positives from the analysis results, while the second set without defects was used for evaluating false positives. The researchers also tried to keep the test suites as simple as possible while keeping in mind the nature of the automotive software environment, especially in relation to the use of global variables and stack memory. (The test suite benchmarks are available on Github .) On average, the six static analysis tool finalists were correct in their detection of flaws in the code about 34 to 65 percent of the time, with Grammatech's Code Sonar ranking first. But a more important measure is how they ranked on various kinds of other tests. For example, on static memory defects, Gramatech's Code Sonar and the MathWorks Polyspace ranked highest. On pointer-related defects, PRQA ranked highest. On detecting concurrency defects, Code Sonar came out on top, while on numerical defects, Mathwork's tool did best. The report goes into much more detail, of course. Dry and matter-of-fact as it is, though, their report is worth going through slowly line by line because of its value to developers concerned about writing reliable, safe code. The paper is a gold mine of information that I will refer to over and over again, not only for insight into which tools are best, but for the nature of the defects being hunted and the impact they have on overall software reliability. It is also valuable if you are interested in learning how to develop a methodology for assessing the complex trade-offs that must be made. Perhaps the most valuable lesson to be learned from this paper is the clear inherent message that if you are serious about detecting software defects, more than one tool is necessary, no matter how much the design team manager or the company accountant complains about the expense.