Faster Builds vs Stable Builds

Ashley FriezeFebruary 1st, 2022Last Updated: January 24th, 2022

0 155 2 minutes read

We have a very comprehensive set of tests on our Java server application. There’s a mix of quick unit tests and slower integration style tests which test some of the behaviour of the plain HTML pages by executing requests using HtmlUnit to simulate the browser, with a fake back end. Other tests use Docker to simulate services like MySQL or Redis.

Individually no test takes more than a few seconds to run. Most of them take milliseconds.

However, there’s a problem. When you want to run the entire test pack, there’s a law of physics problem that it genuinely takes a lot of compute resource. A build, plus the whole test pack takes longer than 10 minutes of decent CPU time run sequentially.

What if we disabled slow tests? Well, sure, but we’d want to run them during the pipeline in general, and the latency to deploy is also a concern.

What if we disabled them for certain builds? Then we’d be delaying any issues they find (and they can find issues) until too late.

So, we decided to beef up the spec of the build server and attempt to parallelise the tests.

This leads to an important test design point:

Tests should be independent enough that they can be safely parallelised.

With a small number of exceptions this was true of every test case, but we decided to parallelise at the level of a test fixture, and the test fixtures were definitely entirely thread safe.

However, there’s another law of physics problem with parallelising tests on the same machine. Call it the second law of test optimisation:

As we try to make the build faster with parallelisation, it fights back in terms of stability.

In reality, the more tricks you pull to make more use of the machine you’re running on, the more that tests start to flicker owing to timing issue, often issues that are beyond your control.

As an example, the HtmlUnit virtual browser performs in a slightly flaky way when the machine is running slowly, hitting timing issues in loading and processing javascript files, and occasionally refusing to believe that certain libraries are loaded.

Theoretically, the machine running tests in parallel (and in fact running neighbouring services’ builds in parallel) can encounter memory issues as well as CPU issues as it’s hammered.

Depending on how deterministic the parallelisation is, in terms of the work being divided among the CPUs, there may be occasional builds where things are more chaotic and go wrong, and some where everything runs smoothly.

The alternative would be to parallelise across different build machines. However, this has two overheads making it unattractive:

Orchestrating the fragments of the build
Paying for the spin-up time of each build worker multiple times over

The best solution I’ve found is to find some minimum of:

Amount of parallelisation – downtuning that can make things faster
Amount of resources available to the build
Amount of flakiness we can tolerate
Fine-tuning tests to be naturally faster, or to parallelise better – dividing fixtures more evenly for example

We really want to be able to express our tests and services in their most natural form, but occasionally, real world limitations step in, and we need to make things more sympathetic for the build.

I got the build down to about 7 minutes… it won’t go any faster, and as the app grows, that number rises again.

Published on Java Code Geeks with permission by Ashley Frieze, partner at our JCG program. See the original article here: Faster Builds vs Stable Builds

Opinions expressed by Java Code Geeks contributors are their own.

Faster Builds vs Stable Builds

Thank you!

Ashley Frieze

Thank you!

Thank you!

Related Articles

Thank you!