The pursuit of protection: How much testing is “enough”?

Jim BirdMay 24th, 2012Last Updated: October 22nd, 2012

0 26 9 minutes read

I’m definitely not a testing expert. I’m a manager who wants to know when the software that we are building is finished, safe and ready to ship.

Large-scale enterprise systems – the kinds of systems that I work on – are essentially hard to test. They have lots of rules and exceptions and lots of interfaces and lots of customization for different customers and partners, and lots of operational dependencies, and they deal with lots of data. We can’t test everything – there are tens of thousands or hundreds of thousands of different scenarios and different paths to follow.

This gets easier and harder if you are working in Agile methods, building and releasing small pieces of work at a time. Most changes or new features are easy enough to understand and test by themselves. The bigger problem is in understanding the impact of each change on the rest of the system that has already been built, what side-effects the change may have, what might have broke. This gets harder if a change is introduced in small steps over several releases, so that some parts are incomplete or even invisible to the test team for a while.

People who write flight control software or medical device controllers need to do exhaustive testing, but the rest of us can’t afford to, and there are clearly diminishing returns. So if you can’t or aren’t going to “test everything”, how do you know when you’re done testing? One answer is that you’re done testing when you run out of time to do any more testing. But that’s not good enough.

You’re done testing when your testers say they’re done

Another answer is that you’re done when the test team says they’re done. When all of the static analysis findings have been reviewed and corrected. When all of the automated tests pass. When the testers have made sure that all features that are supposed to be complete were completed and secure, finished their test checklists, made sure that the software is usable and checked for fit-and-finish, tested for performance and stability, made sure that the deployment and rollback steps work, and completed enough exploratory testing that they’ve stopped finding interesting bugs, and the bugs that they have found (the important ones at least) have all been fixed and re-checked.

This of course assumes that they tested the right things – that they understood the business requirements and priorities, and found most of the interesting and important bugs in the system. But how do you know that they’ve done a good job?

What a lot of testers do is black box testing, which falls into two different forms:

Scripted functional and acceptance testing, manual and automated – how good the testing is depends on how complete and clear the requirements are (which is a challenge for small Agile teams working through informal requirements that keep changing), and how much time the testers have to plan out and run their tests.
Unscripted behavioural or exploratory manual testing – depends on the experience and skill of the tester, and on their familiarity with the system and their understanding of the domain.

With black box testing, you have to trust in the capabilities and care of the people doing the testing work. Even if they have taken a structured, methodical approach to defining and running tests they are still going to miss something. The question is – what, and how much?

Using Code Coverage

To know when you’ve tested enough, you have to stop testing in the dark. You have to look inside the code, using white box structural testing techniques to understand what code has been tested, and then look closer at the code to figure out how to test the code that wasn’t.

A study at Microsoft over 5 years involving thousands of testers found that with scripted, structured functional testing, testers could cover as much as 83% of the code. With exploratory testing they could raise this a few percentage points, to as high as 86%. Then, by looking at code coverage and walking through what was tested and what wasn’t, they were able to come up with tests that brought coverage up above 90%.

Using code coverage this way, instrumenting code under test and then looking into the code and reviewing and improving the tests that have already been written and figuring out what new tests to write, needs testers and developers to work together even more closely.

How much code coverage is enough?

If you’re measuring code coverage, the question that comes up is how much coverage is enough? What percentage of your code should be covered before you can ship? 100%? 90%? 80%? You will find a lot of different numbers in the literature and I have yet to find solid evidence showing that any given number is better than another. Cedric Beust, Breaking Away from the Unit Test Group Think

In Continuous Delivery, Jez Humble and David Farley set 80% coverage as a target for each of automated unit testing, functional testing and acceptance testing. Based on their experience, this should provide comprehensive testing.

Some TDD and XP advocates argue for 100% automated test coverage, which is a target to aim for if you are starting off from scratch and want to maintain high standards, especially for smaller systems. But 100% is unnecessarily expensive, and it’s a hopeless target for a large legacy system that doesn’t have extensive automated tests already in place. You’ll reach a point of diminishing returns as you continue to add tests, where each tests costs more to write and finds less. The more tests that you write, the more tests will be bad tests – duplicate tests that seem to test different things but don’t, tests that don’t test anything important (even if they help make the code coverage numbers look a little better), tests that don’t work but look like they do. All of these tests, good or bad, need to run continuously and need to be maintained and get in the way of making changes. The costs keep going up. How many shops can afford to achieve this level of coverage, and sustain it over a long period of time, or even want to?

Making Code Coverage Work for You

On the team that I manage now, we rely on automated unit and functional testing at around 70% (statement) coverage – higher in high-risk areas, lower in others. Obviously, automated coverage is also higher in areas that are easier to test with automated tools. We hit this level of coverage more than 3 years ago and it has held steady since then. There hasn’t been a good reason to push it higher – it gives us enough of a safety net for developers to make most changes safely, and it frees the test team up to focus on risks and exceptions.

Of course with the other kinds of testing that we do, manual functional testing and exploratory testing and multi-player war games, semi-automated integration testing and performance testing, and operational system testing, coverage in the end is much higher than 70% for each release. We’ve instrumented some of our manual testing work, to see what code we are covering in our smoke tests and integration testing and exploratory testing work, but it hasn’t been practical so far to instrument all of the testing to get a final sum in a release.

Defect Density, Defect Seeding and Capture/Recapture – Does anybody really do this?

In an article in IEEE Software Best Practices from 1997, Steve McConnell talks about using statistical defect data to understand when you have done enough testing.

The first approach is to use Defect Density data (# of defects per KLOC or some other common definition of size) from previous releases of the system, or even other systems that you have worked on. Add up how many defects were found in testing (assuming that you track this data – some Lean/Agile teams don’t, we do) and how many were found in production. Then measure the size of the change set for each of these releases to calculate the defect density. Do the same for the release that you are working on now, and compare the results. Assuming that your development approach hasn’t changed significantly, you should be able to predict how many more bugs still need to be found and fixed. The more data, of course, the better your predictions.

Defect Seeding, also known as bebugging,is where someone inserts bugs on purpose and then you see how many of these bugs are found by other people in reviews and testing. The percentage of the known [seeded] bugs not found gives an indication of the real bugs that remain. Apparently some teams at IBM, HP and Motorola have used Defect Seeding, and it must come up a lot in interviews for software testing labs (Google “What is Defect Seeding?”), but it doesn’t look like a practical or safe way to estimate test coverage. First, you need to know that you’ve seeded the “right” kind of bugs, across enough of the code to be representative – you have to be good at making bugs on purpose, which isn’t as easy as it sounds. If you do a Mickey Mouse job of seeding the defects and make them too easy to find, you will get a false sense of confidence in your reviews and testing – if the team finds most or all of the seeded bugs, that doesn’t mean that they’ve found most or all of the real bugs. Bugs tend to be simple and obvious, or subtle and hard to find, and bugs tend to cluster in code that was badly designed or badly written, so the seeded bugs need to somehow represent this. And I don’t like the idea of putting bugs into code on purpose. As McConnell points out, you have to be careful in removing the seeded bugs and then do still more testing to make sure that you didn’t break anything.

And finally, there is Capture/Re-Capture, an approach used to estimate wildlife populations (catch and tag fish in a lake, then see how many of the tagged fish you catch again later), which Watts Humphrey introduced to software engineering as part of TSP to estimate remaining defects from the results of testing or reviews. According to Michael Howard, this approach is sometimes used at Microsoft for security code reviews, so let’s explore this context. You have two reviewers. Both review the same code for the same kinds of problems. Add up the number of problems found by the first reviewer (A), the number found by the second reviewer (B), and separately count the common problems that both reviewers found, where they overlap (C). The total number of estimated defects: A*B/C. The total number of defects found: A+B-C. The total number of defects remaining: A*B/C – (A+B-C).

Using Michael Howard’s example, if Reviewer A found 10 problems, and Reviewer B found 12 problems, and 4 of these problems were found by both reviewers in common, the total number of estimated defects is 10*12/4=30. The total number of defects found so far: 18. So there are 12 more defects still to be found.

I’m not a statistician either, so this seems like magic to me, and to others. But like the other statistical techniques, I don’t see it scaling down effectively. You need enough people doing enough work over enough time to get useful stats. It works better for large teams working in Waterfall-style, with a long test-and-fix cycle before release. With a small number of people working in small, incremental batches, you get too much variability – a good reviewer or tester could find most or all of the problems that the other reviewers or testers found. But this doesn’t mean that you’ve found all of the bugs in the system.

Your testing is good enough until a problem shows that it is not good enough

In the end, as Martin Fowler points out, you won’t really know if your testing was good enough until you see what happens in production: The reason, of course, why people focus on coverage numbers is because they want to know if they are testing enough. Certainly low coverage numbers, say below half, are a sign of trouble. But high numbers don’t necessarily mean much, and lead to “ignorance-promoting dashboards”. Sufficiency of testing is a much more complicated attribute than coverage can answer. I would say you are doing enough testing if the following is true:

You rarely get bugs that escape into production, and
You are rarely hesitant to change some code for fear it will cause production bugs.

Test everything that you can afford to. Release the code. When problems happen in production, fix them, then use Root Cause Analysis to find out why they happened and to figure out how you’re going to prevent problems in the future, how to improve the code and how to improve the way you write it and how you test it. Keep learning and keep going.

Reference: The pursuit of protection: How much testing is “enough”? from our JCG partner Jim Bird at the Building Real Software blog.