Continuous Reliability: The One Thing Your CI/CD Workflow is Missing

Tali SorokerMay 28th, 2018Last Updated: June 4th, 2018

1 43 3 minutes read

Successful teams know that CI/CD isn’t enough. With things breaking faster than ever, many are adding Continuous Reliability to their workflows.

Most engineering teams have adopted an agile development practice and are pushing for shorter and faster release cycles. The difficulties associated with more frequent deployments to production, not to mention evergrowing code bases, led to the rise of Continuous Integration and Continuous Delivery/Deployment (CI/CD) tools and workflows.

CI/CD tools add a lot of automation to the build-test-deploy process, but they don’t address one of the biggest problems teams face when building and deploying new code… Unpredictable errors and exceptions.

How do you gain a reliable measure of the overall quality of your code? Have you tested everything? How can you ensure that what you are about to release is “safe”?

This paper explores answers to these questions and outlines how you can adopt the concept of Continuous Reliability into your workflows.

Get a sneak peek below…

4 Causes for Failing CI/CD Workflows

Over the years, we’ve spoken with hundreds of development and operations teams and heard the same thing repeated again and again. We all need to move quickly, but this sometimes has an adverse effect on reliability and quality of the code that makes it to production. CI/CD is great, but we still need deeper test coverage. Before we can understand what’s really broken about the current CI/CD release cycle (and how to fix it), we need to understand some of the challenges that teams face after new code is deployed:

1. It’s Impossible to Test Everything

Even with the most thorough testing, staging and QA process, errors slip through the cracks. After hours of testing, uncaught and unexpected exceptions are still bound to get through to production. It is simply impossible to conceptualize and implement a 100% comprehensive set of tests for every condition/function. On top of that, CI/CD speeds up every part of the release cycle, and contributes to more errors passing through unseen into production in a shorter amount of time.

Bottom Line: The tests we write are unable to catch unexpected failure scenarios.

2. Limited Insight into the Overall Quality of an Application

The concepts of CI and CD are predicated on the ability to not only automate the promotion of code across build, test and deploy, but also on the ability to automate and measure the functional quality of our code. While no suite of tests is complete (as mentioned above), there also seems to be very limited ability to gauge the overall quality of our applications and services as they flow from dev to production. We can measure failed tests or count errors, but there lacks an ability to understand the nature of these failures in aggregate across a codebase. How do we know how many critical issues we have? How many new errors have been introduced? How many issues have resurfaced? Having this level of detail could help answer the critical answer of whether or not it is safe to promote.

Bottom Line:he decision whether or not to promote code to the next step in the release cycle is ambiguous at best, and random at worst.

3. Issue Resolution still takes forever and a day

With unknown and unexpected errors getting into production at a faster rate than before, immediate identification of issues and quick troubleshooting is more crucial than ever. Unfortunately, the current practice for finding and handling errors in production is inherently flawed. Customers are the first to reveal errors in the applications and engineers end up spending an average of 20-40% of their time digging through log files and bouncing between monitoring tools trying to understand what went wrong.

Bottom Line: When things fail, we don’t always know. Even when we do know, we don’t have the full context, and have to spend a considerable amount of time on troubleshooting vs. building new features.

4. Automated Application Failure

A chain is only as strong as its weakest link, and the same rings true for the software development lifecycle. If an error slips through and breaks the build, or goes unnoticed all the way up to production, the CI/CD workflow helps to automate application failure.

With CI/CD, the quality of your testing determines the quality of your releases. Because we can’t write a fully comprehensive suite of tests, we need another way to determine the quality of our code to indicate if it’s safe to promote to the next environment. Otherwise, we risk pushing an increased number of errors to the hands of our users.

Bottom Line: Automated build-test-deploy often turns into build-test-deploy-break.

All of this begs the question then: If so many teams are experiencing the same types of failures when code hits production, is there something fundamentally broken in the way we implement CI/CD and automation?