How I broke our continuous deployment

This post is about a failure – more precisely about how I managed to bring our release processes to their knees. I can recommend reading especially if you are planning to ruin your release train any time soon. Following my footsteps is a darn good way to bring down your automated  processes for weeks.

But let me start with a rough description of how we release new Plumbr versions to end users. For this we have created two environments – production and test, both on dedicated servers. Both environments have LiveRebel orchestrating our Continuous Integration – built version deployments. The test environment is updated after each CI run, production environment is updated on a daily basis.

On the 21st of January we stopped nightly production updates as one of the larger changes needed some more time in QA processes. The intention was to stop the updates for just a few days. At the same time the test environment was still being updated several times a day.

As you might guess, those few days turned into ten days. So it was not until 30th of January until we were again able to re-enable the automatic updates in production environment. Only to discover the next morning that the update had failed with the following error message during application initialization:

‘java.security.ProviderException: Could not initialize NSS’

As we were using LiveRebel, the update was rolled back automatically and the production site kept using the version deployed 10 days earlier. So no major harm was caused, especially after I discovered that manual push to production worked just fine.

But the automated update failed again on the following night. And the next one. So it was already on 5th of February when I finally found time to investigate the issue more thoroughly.

30 minutes of googling revealed only one clue – in the format of a discussion thread about libnss3 configuration errors. Making the proposed changes to the nss.cfg indeed seemed to work and the automated releases started rolling out.

After the fix, three important questions rose:

  • What broke this configuration? Nobody admitted altering the machine configuration during the ten days the build idled.
  • Why was the test environment working just fine?
  • Why did manual updates work and just the LiveRebel orchestrated releases kept failing?

Answers started to take shape when we investigated the logs. Apparently we had automated updates enabled for the production JVM, so on January 23rd the automated update had pulled both the fresh openjdk-7-jdk_7u51 patch and libnss3 patch and applied the upgrade. So we had found the answer to our first question.

The second answer surfaced when we compared environment configurations. Our test environment was running on 32-bit JVMs, as opposed to 64-bit production machines. Why the 32-bit JDK upgrade did not break the backwards compatibility to libnss is another question, but we again had found the difference which, when removed, exposed the problem also in test environment.

The answer to the third and last question became clear when we correlated the uptimes of test and production LiveRebel instances. As the test LiveRebel had been restarted and had picked up new libnss3 configuration, we had found the reason for it running without configuration issues.

Having spent more than a day figuring out all of the above, I can only conclude that the following has to be kept in mind when creating or maintaining your build:

  • The environments in your release stream must be identical. No excuses allowed.
  • Creating the environments must be fully automated, including the OS-level configuration. We were automating our builds only from the JVM level onwards, but the OSes had been configured manually.
  • When something is failing, it ain’t gonna fix itself. The sooner you find the root cause the sooner you can switch back to productive work. As opposed to dealing with the consequences as I was when rolling out manual updates.

Admittedly, these are obvious conclusions. But sometimes the obvious things have to be reminded.
 

Reference: How I broke our continuous deployment from our JCG partner Vladimir Šor at the Plumbr Blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "How I broke our continuous deployment"

  1. Peter Verhas says:

    I see another point.

    Any change is a new release. This way you can see which change broke the system. Change in the application, change in the middleware, change in the JVM, change in the OS, change in the hardware.

    As far as I could see from your article, your team assumed, that an upgrade to JVM is not a change that would justify a new “release” on the server. If this is the result of a careful assessment comparing the extra costs over years to treat JVM patches as releases to your app environment to the cost you spend once in every five years, then this is OK. If not, then you have to do this assessment, even if the outcome seems to be obvious and keeps the practice on the track it is now. No official calculations, documents. Just a ten minutes to maximum an hour assessment meeting so that common sense comes up as well as to educate yourselves. After that a few paragraph wiki page in your knowledge base and that is it. One day of extra work hunting that issue worth it.

    And thanks for the article. I like it and find it valuable. And yes: tests systems and production systems should be as identical as economically fit and any difference has to be assessed, no just happen. But again, that is generally true. If you are a professional nothing is to “just happen”, is it?

Leave a Reply


6 − four =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close