How I broke our continuous deployment

Vladimir SorFebruary 24th, 2014Last Updated: February 23rd, 2014

1 17 3 minutes read

This post is about a failure – more precisely about how I managed to bring our release processes to their knees. I can recommend reading especially if you are planning to ruin your release train any time soon. Following my footsteps is a darn good way to bring down your automated processes for weeks.

But let me start with a rough description of how we release new Plumbr versions to end users. For this we have created two environments – production and test, both on dedicated servers. Both environments have LiveRebel orchestrating our Continuous Integration – built version deployments. The test environment is updated after each CI run, production environment is updated on a daily basis.

On the 21st of January we stopped nightly production updates as one of the larger changes needed some more time in QA processes. The intention was to stop the updates for just a few days. At the same time the test environment was still being updated several times a day.

As you might guess, those few days turned into ten days. So it was not until 30th of January until we were again able to re-enable the automatic updates in production environment. Only to discover the next morning that the update had failed with the following error message during application initialization:

‘java.security.ProviderException: Could not initialize NSS’

As we were using LiveRebel, the update was rolled back automatically and the production site kept using the version deployed 10 days earlier. So no major harm was caused, especially after I discovered that manual push to production worked just fine.

But the automated update failed again on the following night. And the next one. So it was already on 5th of February when I finally found time to investigate the issue more thoroughly.

30 minutes of googling revealed only one clue – in the format of a discussion thread about libnss3 configuration errors. Making the proposed changes to the nss.cfg indeed seemed to work and the automated releases started rolling out.

After the fix, three important questions rose:

What broke this configuration? Nobody admitted altering the machine configuration during the ten days the build idled.
Why was the test environment working just fine?
Why did manual updates work and just the LiveRebel orchestrated releases kept failing?

Answers started to take shape when we investigated the logs. Apparently we had automated updates enabled for the production JVM, so on January 23rd the automated update had pulled both the fresh openjdk-7-jdk_7u51 patch and libnss3 patch and applied the upgrade. So we had found the answer to our first question.

The second answer surfaced when we compared environment configurations. Our test environment was running on 32-bit JVMs, as opposed to 64-bit production machines. Why the 32-bit JDK upgrade did not break the backwards compatibility to libnss is another question, but we again had found the difference which, when removed, exposed the problem also in test environment.

The answer to the third and last question became clear when we correlated the uptimes of test and production LiveRebel instances. As the test LiveRebel had been restarted and had picked up new libnss3 configuration, we had found the reason for it running without configuration issues.

Having spent more than a day figuring out all of the above, I can only conclude that the following has to be kept in mind when creating or maintaining your build:

The environments in your release stream must be identical. No excuses allowed.
Creating the environments must be fully automated, including the OS-level configuration. We were automating our builds only from the JVM level onwards, but the OSes had been configured manually.
When something is failing, it ain’t gonna fix itself. The sooner you find the root cause the sooner you can switch back to productive work. As opposed to dealing with the consequences as I was when rolling out manual updates.

Admittedly, these are obvious conclusions. But sometimes the obvious things have to be reminded.

Reference: How I broke our continuous deployment from our JCG partner Vladimir Šor at the Plumbr Blog blog.