What’s better – Big Fat Tests or Little Tests?

Jim BirdAugust 27th, 2012Last Updated: October 22nd, 2012

0 35 6 minutes read

Like most startups, we built a lot of prototypes and wrote and threw out a lot of code as we tried out different ideas. Because we were throwing out the code anyways, we didn’t bother writing tests – why write tests that you’ll just throw away too?

But as we ramped the team up to build the prototype out into a working system, we got into trouble early. We were pushing our small test team too hard trying to keep up with changes and new features, while still trying to make sure that the core system was working properly. We needed to get a good automated test capability in place fast.

The quickest way to do this was by writing what Michael Feathers calls “Characterization Tests”: automated tests – written at inflection points in an existing code base – that capture the behavior of parts of a system, so that you know if you’ve affected existing behavior when you change or fix something. Once you’ve reviewed these tests to make sure that what the system is doing is actually what it is supposed to be doing, the tests become an effective regression tool.

The tests that we wrote to do this are bigger and broader than unit tests – they’re fat developer-facing tests that run beneath the UI and validate a business function or a business rule involving one or more system components or subsystems. Unlike customer-facing functional tests, they don’t require manual setup or verification. Most of these tests are positive, happy path tests that make sure that important functions in the system are working properly, and that test validation functions.

Using fat and happy tests as a starting point for test automation is described in the Continuous Delivery book. The idea is to automate high-value high-risk test scenarios that cover as much of the important parts of the system as you can with a small number of tests. This gives you a “smoke test” to start, and the core of a test suite.

Today we have thousands of automated tests that run in our Continuous Integration environment. Developers write small unit tests, especially in new parts of the code and where we need to test through a lot of different logical paths and variations quickly. But a big part of our automated tests are still fat, or at least chubby, functional component tests and linked integration tests that explore different paths through the main parts of the system.

We use code coverage analysis to identify weak spots, areas where we need to add more automated tests or do more manual testing. Using a combination of unit tests and component tests we get high (90%+) test coverage in core parts of the application, and we exercise a lot of the general plumbing of the system regularly.

It’s easy to test server-side services this way, using a common pattern: set up initial state in a database or memory, perform some action using a message or API call, verify the expected results (including messages and database changes and in-memory state) and then roll-back state and prepare for the next test.

We also have hundreds of much bigger and fatter integration and acceptance tests that test client UI functions and client API functions through to the server. These “really big fat” tests involve a lot more setup work and have more moving parts, are harder to write and require more maintenance, and take longer to run. They are also more fragile and need to be changed more often. But they test real end-to-end scenarios that can catch real problems like intermittent system race conditions as well as regressions.

What’s good and bad about fat tests?

There are advantages and disadvantages in relying on fat tests.

First, bigger tests have more dependencies. They need more setup work and more test infrastructure, they have more steps, and they take longer to run than unit tests. You need to take time to design a test approach and to create templates and utilities to make it easy to write and maintain bigger tests.

You’ll end up with more waste and overlap: common code that gets exercised over and over, just like in the real world. You’ll have to put in better hardware to run the tests, and testing pipelines so that more expensive testing (like the really fat integration and acceptance testing) is done later and less often.

Feedback from big tests isn’t as fast or as direct when tests fail. Gerard Meszaros points out that the bigger the test, the harder is to understand what actually broke – you know that there is a real problem, but you have more digging to figure out where the problem is. Feedback to the developer is less immediate: bigger tests run slower than small tests and you have more debugging work to do. We’ve done a lot of work on providing contextual information when tests fail so that programmers can move faster to figuring out what’s broken. And from a regression test standpoint, it’s usually obvious that whatever broke the system is whatever you just changed, so….

As you work more on a large system, it is less important to get immediate and local feedback on the change that you just made and more important to make sure that you didn’t break something else somewhere else, that you didn’t make an incorrect assumption or break a contract of some kind, or introduce a side-effect. Big component tests and interaction tests help catch important problems faster. They tell you more about the state of the system, how healthy it is. You can have a lot of small unit tests that are passing, but that won’t give you as much confidence as a smaller number of fat tests that tell you that the core functions of the system are working correctly.

Bigger tests also tell you more about what the system does and how it works. I don’t buy the idea that tests make for good documentation of a system – at least unit tests don’t. It’s unrealistic to expect a developer to pick up how a system works from looking at hundreds or thousands of unit tests. But new people joining a team can look at functional tests to understand the important functions of the system and what the rules of the system are. And testers, even non-technical manual testers, can read the tests and understand what tests scenarios are covered and what aren’t, and use this to guide their own testing and review work.

Meszaros also explains that good automated developer tests, even tests at the class or method level, should always be black box tests, so that if you need to change the implementation in refactoring or for optimization, you can do this without breaking a lot of tests. Fat tests make these black boxes bigger, raising it to a component or service level. This makes it even easier to change implementation details without having to fix tests – as long as you don’t change public interfaces and public behavior, (which are dangerous changes to make anyways), the tests will still run fine.

But this also means that you can make mistakes in implementation that won’t be caught by functional tests – behavior outside of the box hasn’t changed, but something inside the box might still be wrong, a mistake that won’t trip you up until later. Fat tests won’t find these kinds of mistakes, and they won’t catch other detailed mistakes like missing some validation.

It’s harder to write negative tests and to test error handling code this way, because the internal exception paths are often blocked at a higher level. You’ll need other kinds of testing, including unit tests and manual exploratory testing and destructive testing to check edge cases and catch problems in exception handling.

Would we do it this way again?

I’d like to think that if we started something brand new again, we’d start off in a more disciplined way, test first and all that. But I can’t promise. When you are trying to get to the right idea as quickly as possible, anything that gets in the way and slows down thinking and feedback is going to be put aside. It’s once you’ve got something that is close-to-right and close-to-working and you need to make sure that it keeps working, that testing becomes an imperative.

You need both small unit tests and chubby functional tests and some big fat integration and end-to-end tests to do a proper job of automated testing. It’s not an either/or argument.

But writing fat, functional and interaction tests will pay back faster in the short-term, because you can cover more of the important scenarios faster with fewer tests. And they pay back over time in regression, because you always know that aren’t breaking anything important, and you know that you are exercising the paths and scenarios that your customers are or will be – the paths and scenarios that should be tested all of the time. When it comes to automated testing, some extra fat is a good thing.

Reference: What’s better – Big Fat Tests or Little Tests? from our JCG partner Jim Bird at the Building Real Software blog.