Security Engineering: What Works?
18 December 2014
Last week I had the opportunity to attend a joint US National Academy of Sciences and UK Royal Society event on cyber-security in Washington DC. One of the speakers, a true expert that I respect very much, described how they envision building (more) secure systems, and others in the audience provided their opinion (Chatham House Rule prevents me from disclosing names). The debate was of high quality, however it did strike me that it remained at the level of expert opinion. After 40 years of research in cyber-security, should we not be doing better than expert opinion when it comes to understanding how to engineer secure systems?
First, let me say that I have a great appreciation for craftsmanship and the deep insights that come from years of practice. Therefore when someone with experience tells me to follow a certain course of action to engineer a systems, in the absence of any other evidence, I do listen carefully. However, expert opinion is only one, and in some respects the weakest form of evidence in what researchers in other disciplines have defined as a hierarchy of evidence. Stronger forms of evidence include case studies, case-control and cohort studies, double-blind studies with good sample sizes and significant results, and systematic meta-analyses and reviews.
In security engineering we have quite a few case reports, particularly relating to specific failures, in the form of design flaws and implementation bugs. We also have a set of methodologies as well as techniques and tools that are meant to help with security engineering. Which work, and at what cost? How do they compare with each other? What are the non-security risks (cost, complexity, training, planning) associated with them? There is remarkably little evidence, besides at best expert opinion, at worse flaming, to decide. This is particularly surprising, since a number of very skilled people have spent considerable time advocating for their favorite engineering paradigms in the name of security: static analysis, penetration testing, code reviews, strong typing, security testing, secure design and implementation methodologies, verification, pair-coding, use of specific frameworks, etc. However, besides opinion it is hard to find much evidence of how well these work in reducing security problems.
I performed a quick literature survey, which I add here for my own future benefit:
Wagner, S., Jürjens, J., Koller, C., & Trischberger, P. (2005). Comparing bug finding tools with reviews and tests. In Testing of Communicating Systems (pp. 40-55). Springer Berlin Heidelberg. (Cites: 87)
This work compares testing, using bug finding tools and code reviews. The study was not specific to security bugs. Its key finding is that “bug finding tools predominantly find different defects than testing but a subset of defects found by reviews.” In adition to this, the authors do note that bug finding tools have a high number of false positives, and that the tools show different results in different projects. So the authors conculde that: “Dynamic tests or reviews cannot be substituted by bug finding tools because they find significantly more and different defect types”; “Bug finding tools can be a good pre-stage to a review because some of the defects do not have to be manually detected. A possibility would be to mark problematic code so that it cannot be overlooked in the review.” Interesting to note (Table 6) that bug finding tools catch bugs related to the quality if the code, tests catch bugs related to the function of the code, and reviews catch a bit of both. Quality of evidence: used production level software and human reviewers to establish findings.
Palsberg, J. (2001, June). Type-based analysis and applications. In Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering (pp. 20-27). ACM. (Cites: 54)
This paper provides an introduction to type-based static analysis, and compares it with other techaniques for catching bugs. No empirical evidence using real programs, real developers, real reviewers, etc.
Zitser, M., Lippmann, R., & Leek, T. (2004, October). Testing static analysis tools using exploitable buffer overflows from open source code. In ACM SIGSOFT Software Engineering Notes (Vol. 29, No. 6, pp. 97-106). ACM. (Cites: 204)
Paper of historic value, looking at the effectivness of bug finding tools for detecting buffer overflows over 10 years ago (Archer, Boon, PolySpace, Splint, Uno). Its key finding is that only one of them (Polyspace) had a performance significantly better than random. The evidence was gathered by introducing known bugs into real programs and using the tools to detect them (but no real reviewers, real devs or real projects).
Ayewah, N., Pugh, W., Morgenthaler, J. D., Penix, J., & Zhou, Y. (2007, June). Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering (pp. 1-8). ACM. (Cites: 114)
Reports on experiences applying static analysis tools (FindBugs) to Sun’s JRE 6, Sun’s JEE and internal Google code bases. At the time of writing any google code modified was run through FindBugs, and two engieers would review warnings to see which should be reported to the developpers (after some cost benefit analysis). Interesting finding: the older the code a defect was found in the less likely it was to be fixed (known obsolete code, original dev has moved on, code more trusted and tested due to age). On the google code base out of 1127 high/medium priority warnings 518 were fixed (i.e. acted upon) and 289 were still open.
Wagner, S. (2006, September). A literature survey of the quality economics of defect-detection techniques. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering (pp. 194-203). ACM. (Cites: 31)
This paper tries to build some economic model of different defect detection techniques. The defects are not specific to security. Conclusion include “test techniques tend to be more efficient in defect detection having lesser difficulties but to have larger removal costs”; […] “during unit tests fault removal is considerably cheaper than during system tests. This suggests that unit-testing is very cost-efficient.”;
Boogerd, C., & Moonen, L. (2009, May). Evaluating the relation between coding standard violations and faults within and across software versions. In Mining Software Repositories, 2009. MSR’09. 6th IEEE International Working Conference on (pp. 41-50). IEEE. (Cites: 30)
This work looks at whether coding standard violations are good predictors of software defects. Overall releases with a higher density of violations were not found to be more error prone; They found some small evidence that violating 10 specific rules at the module or file level might be associated with more faults (with caveats); a similar result was observed for specific lines of code where these selected rules were violated. So overall only “10 out of the 88 rules sigificantly surpassed the measured fault injection rate”. However, the authors warn that these rules may be different for different projects. Real projects were used in the analysis.
Antunes, N., & Vieira, M. (2009, November). Comparing the effectiveness of penetration testing and static code analysis on the detection of sql injection vulnerabilities in web services. In Dependable Computing, 2009. PRDC’09. 15th IEEE Pacific Rim International Symposium on (pp. 301-306). IEEE. (Cites: 33)
This paper looks specifically at SQL injection bugs, and how penetration testing tools and static code analysis tools compare in finding them. Overall they find that tools find more bugs than pen testers, but at the cost of a high number of false positives. Different tools also seem to find different bugs. The programs analyzed seem pretty small (100-300 lines), and only penetration testing tools (black box) not actual pen testers were tried.
Elberzhager, F., Münch, J., & Nha, V. T. N. (2012). A systematic mapping study on the combination of static and dynamic quality assurance techniques. Information and Software Technology, 54(1), 1-15. (Cites: 28)
This is a meta-study or articles on static and dynamic analysis. Conclusions and findings on the community of practice but the the substantive issue of what works.
Wagner, S., & Seifert, T. (2005). Software quality economics for defect-detection techniques using failure prediction. ACM SIGSOFT Software Engineering Notes, 30(4), 1-6. (Cites: 20)
This focuses on a model of defects and economics with only one (small) application.
Finifter, M., & Wagner, D. (2011, June). Exploring the relationship between Web application development tools and security. In USENIX conference on Web application development.(Cites: 16)
This paper looks at how web devs should chose development and analysis tools and the consequences for security. Finding are: no relation between choice of programming language and application security; automated mechanism for protection such as CSRF and session management are effective, whereas manual ones provide little value; manual review is more effective than black-box testing, but the two are complementary. Evidence based on 8 apps written in 3 languages (Perl, Java, Php) using various frameworks; one reviewer; black box testing using burp suite; There is an interesting set of future research hypothesis at the end.
Ayewah, N., & Pugh, W. (2010, July). The google findbugs fixit. In Proceedings of the 19th international symposium on Software testing and analysis (pp. 241-252). ACM. (Cites: 39)
This paper discusses the large scale application of the FindBugs static analysis tool on the whole google code base in 2009. Interestingly most reports, while needing some fix, were not serious indicating that serius problems were caught before static analysis as part of the normal development, testing and deployment. There is some anecdotal evidence that static analysis may have achieved this at lower cost.
Austin, A., & Williams, L. (2011, September). One technique is not enough: A comparison of vulnerability discovery techniques. In Empirical Software Engineering and Measurement (ESEM), 2011 International Symposium on (pp. 97-106). IEEE. (Cites: 36)
This work compares security bug finding techniques systematic and exploratory manual pen testing, static analysis and automated pen testing on two electronic health record systems. They report that no single technique found all vulnerabilities but that manual reviews found most design flaws while static analysis found most implementation bugs. The most vulnerabilities / per time were discovered by automated pen testing but they recommend manual pen testing is also used to find design problems. Evidence based on 2 large 100Ks Java and Php programs;
Baca, D., Petersen, K., Carlsson, B., & Lundberg, L. (2009, March). Static Code Analysis to Detect Software Security Vulnerabilities-Does Experience Matter?. In Availability, Reliability and Security, 2009. ARES’09. International Conference on (pp. 804-810). IEEE. (Cites: 24)
This paper looks at whether devs can interpret correctly the output of static analysis tools to identify security problems and fix them. They find that: “average developers do not correctly identify the security warnings”; “only developers with specific experiences are better than chance in detecting the security vulnerabilities”. “Specific SAT experience more than doubled the number of correct answers and a combination of security experience and SAT
experience almost tripled the number of correct security answers.”
Ayewah, N., & Pugh, W. (2008, July). A report on a survey and study of static analysis users. In Proceedings of the 2008 workshop on Defects in large software systems (pp. 1-5). ACM. (Cites: 22)
This paper looks at how users (devs) use the FindBugs tool. They conclude that users are quicker at reviewing reports from FindBugs compared with Fortify SCA. Method: survey of devs.
Johnson, B., Song, Y., Murphy-Hill, E., & Bowdidge, R. (2013, May). Why don’t software developers use static analysis tools to find bugs?. In Software Engineering (ICSE), 2013 35th International Conference on (pp. 672-681). IEEE. (Cites: 25)
This paper notes static analysis tools are underused. Through and interview with 20 devs they identify (self-reporting) that false positives and warning presentation were barriers to adoption. Interesting qualitative results; no observational study.
Smith, B., & Williams, L. (2012, June). On the Effective Use of Security Test Patterns. In Software Security and Reliability (SERE), 2012 IEEE Sixth International Conference on (pp. 108-117). IEEE. (Cites: 7)
This paper evaluates the technique of “Security Test Patterns” that are meant to allow novice testers, with no experience of security to make test plans incorporating security. Evidence: They used 47 students as novice programmers, and compared the test plans with those of experts to conclude that they are similar when informed by the test patterns.
Finifter, M., Akhawe, D., & Wagner, D. (2013, August). An Empirical Study of Vulnerability Rewards Programs. In USENIX Security (Vol. 13). (Cites: 7)
This paper reviews the chrome and firefox vulnerability reward programs, using real data from 3 years (and over $1M of rewards). These led to about 1/4 of patched vulnerabilities during this period. The authors conclude that the programs are economically efficient compared with hiring in house security talent. The recipients of these awards cannot make a living on them. There are some case control based conclusions about the prevalence of bugs between the two browsers the authors ascribe to architecture, and offer a number of other hypothesis / conjectures to be tested in the future.
Edmundson, A., Holtkamp, B., Rivera, E., Finifter, M., Mettler, A., & Wagner, D. (2013). An empirical study on the effectiveness of security code review. In Engineering Secure Software and Systems (pp. 197-212). Springer Berlin Heidelberg. (Cites: 5)
This work looks at the effectiveness of human code reviewers at finding security related bus, and its variation. Evidence is based on 30 devs that were hired to perform a manual code review of a web-app. They find: no one found all bugs; experience is no good predictor of being more accurate or effective; higher reporting leads to both higher true positives and false positives. Looking at fig 3 the latter is not clear to me (due to linear model use). There is also evidence of Dunning-Kreuger effect when it comes to self-declared experience versus effectiveness (fig. 5). I conclude that the oDesk market may also not provide trustworthy signals of expertise.
Zhao, M., Grossklags, J., & Chen, K. (2014, November). An Exploratory Study of White Hat Behaviors in a Web Vulnerability Disclosure Program. In Proceedings of the 2014 ACM Workshop on Security Information Workers (pp. 51-58). ACM. (Cites: 0)
This paper studies the behavior of white-hat bug finders using a dataset from Wooyu (China). They conclude that diversity of contributors is a key indicator of the quality and breath of bugs found. They recommend that vulnerability reward programs foster this diversity by providing rewards to a wider variety of contributors, instead of focusing only on the top ones.
Lipner, S. (2004, December). The trustworthy computing security development lifecycle. In Computer Security Applications Conference, 2004. 20th Annual (pp. 2-13). IEEE.
This paper introduces the Microsoft SDL, along with evidence that it produces more secure software based on counting secrity bugs in the first year of release of Win 2K versus Win Server 2003. This can be considered a couple of important case reports. It is important to note the reduction but there is no elimination of security bugs.
Manadhata, P. K., & Wing, J. M. (2011). An attack surface metric. Software Engineering, IEEE Transactions on, 37(3), 371-386. (Cites: 158)
This work proposes a metric for attack surface. The tool is tested on real applications, but no results on effectivness in predicting security defects or experiences with devs are reported.
Williams, L., Kudrjavets, G., & Nagappan, N. (2009, November). On the effectiveness of unit test automation at microsoft. In Software Reliability Engineering, 2009. ISSRE’09. 20th International Symposium on (pp. 81-89). IEEE. (cites: 27)
This paper looks at the cost-effectiveness of automated unit testing. The evidence used is from a 32 person real-world development team. It is reported that the introduction of automated, versus ad-hoc, unit testing lead to a 20% reduction in defects at a 30% increase in dev time (relative to previous version not control). The authors note that other teams experienced greater defect decreases when tests were written iterativelly, rather that after a development phase. Three case studies previously reported a decrease in defects of 62% – 91% using test driven development. Interesting qualitative conclusions are presented in Section 5.
Bounimova, E., Godefroid, P., & Molnar, D. (2013, May). Billions and billions of constraints: Whitebox fuzz testing in production. In Proceedings of the 2013 International Conference on Software Engineering (pp. 122-131). IEEE Press. (Cites: 31)
This paper reports on using a white-box fuzzer in a production environment (at Microsoft) and experiences scaling the technology up to larger applications. This is a technology driven paper, and does not report at length on the economics or effectiveness of the technology in reducing security defects.
Baca, D., Carlsson, B., Petersen, K., & Lundberg, L. (2013). Improving software security with static automated code analysis in an industry setting. Software: Practice and Experience, 43(3), 259-279. (Cites: 6)
This is the report of an industrial case study applying static analysis. They report that correcting faults was easier than categorizing warnings. Most faults detected are related to memory management, and few others were detected. The correction of false positives created new vulnerabilities. They propose that two devs categorize the warnings, and decide how to act on them.