The Road to Continuous Deployment (part 3)

The story below details a very interesting & transformational project that I was a part of in 2014 and 2015, at a Dutch company. I’ve told this story before during a number of conference talks (slides and videos are available, if you’re interested), and I’ve now finally come around to writing it up as a series of blog posts!

This is part 3 of a multipart series on how CD, DevOps, and other principles were used to overhaul an existing software application powering multiple online job boards, during a project at a large company in The Netherlands.

In the previous installment, I discussed the Strangler pattern, branches, code reviews and feature toggles!

The Boy Scout Rule

“Leave the campsite in a better state than you found it”

A controversial rule for some, because of its qualitative (and possibly even subjective) rather than quantitative nature.

However, I do feel the rule is useful and applicable in the context of software development. The idea is that if you encounter something that’s not according to the standards you agreed on as a team, and it’s something that’s fixable in a reasonable amount of time, then fix that thing. Rather than asking for permission to do so.

To me, this is part of what constitutes good developer discipline and codebase hygiene. If fixing the issue requires more than a reasonable amount of time, by all means, involve the rest of the team and schedule some time to work on that: preferably as a pair or even in a mob-programming setting!

Quality metrics and gates

Quality is not easy to define, or rather, many different definitions of the word can be found. It’s well known, however, that quality is a precondition for speed: the only way to move fast and keep moving fast over long periods of time is by ensuring quality remains high and is not the victim of (too many) shortcuts.

During the course of development in this project, we used a number of metrics & tools to give us some insight into the quality (or lack thereof) of the code we were producing. Some of the metrics were used to implement hard gates. Hard gates (or build breakers) are created by defining thresholds for metrics. Exceeding a threshold will fail the build. Other metrics were used to implement soft gates, which when violated result in warnings but allow the build to continue.

Code style checks and rule enforcement
Static analysis
Code coverage
Code duplication

On code coverage

Code coverage is another controversial topic, Code coverage on its own doesn’t really say anything about quality, or specifically the quality of the tests that are producing the coverage. But it’s one of the metrics that we use as an indicator, a signal of where things are going, and where we need to focus our efforts.

One of the hard gates in the application, and of the more controversial ones, was enforcing 100% code coverage for all new code written. Now, most of the code in this application was written in PHP, which is an interpreted language that has no compiler acting as part of the safety net.

In compiled languages, removing a method that’s still called by another piece of code will result in a compilation error, and a failed build. In PHP, this is not the case. Having 100% code coverage helps in defeating that problem. Of course, having higher level tests (such as integration, acceptance or end-to-end tests) break on a deleted method will also indicate the issue. However, higher level tests are typically slower, more difficult to maintain, and their coverage is typically accidental, rather than intentional.

Another argument in favor of enforcing 100% code coverage is that every dip below the line (and the associated build failure) triggers a discussion whether to add a “skip” for a particular piece of code. The discussion requires an explicit decision. This typically doesn’t occur for lower thresholds (where 80% seems to be a common number).

Continuous testing

Testing is one of the most important parts of any Continous Delivery story. During my “Road to Continuous Deployment” talks I introduced testing from the perspective of “defense in depth”.

Like the cyber security concept and the military strategy with the same name, the intent is to have multiple layers of testing that all contribute to a high degree confidence that a change to a system won’t break that system’s functionality (in production).

Unit tests form the first, inner-most layer. Unit tests are small, cheap and fast. The next layer is comprised of integration tests, where we test components, systems and the integration with (external) datasources.

After that come acceptance tests, which in this project were implemented through a methodology called Behavior Driven Development. Scenarios and examples were written up in plain English, describing the feature or functionality that was under test. These scenarios were written by business analysts, testers and product owners.

Last but not least: UI (end to end) tests. Typically the slowest of all the layers, and easily the most brittle and expensive to maintain. They have their value though, in certain cases.

An alternative to the “defense in depth” strategy is the Test Pyramid model. NB: Some people prefer talking about Testing Quadrants rather than the Test Pyramid (see https://johnfergusonsmart.com/test-pyramid-heresy/).

At the bottom of the pyramid we find the cheapest and fastest tests, unit tests. Because they are cheap and fast, we have a lot of them. A we go up in the pyramid, tests become more expensive, slow and difficult to maintain. Thus, going up, each successive level in the pyramid will have less tests than the level below that.

Monitoring (and alerting) is an oft overlooked part of testing. If your application starts to slow down, or starts throwing errors an hour after deployment, you want to be able to catch that. You may not be able to catch that in a test, but your monitoring/alerting tooling should be able to!

Deployments: rolling updates

For our deployments we investigated a number of strategies and ended up settling with rolling updates. Essentially, this strategy boils down to iteratively replacing existing production instances with ones containing new / changed code.

Specifically, this meant the pipeline for a service would:

Retrieve new code changes;
Build a new image (incorporating those code changes);
Push that image to our central image repository;
Start a container based on that new image;
Verify the container has started, is listening on the correct port and responds correctly to smoke tests / health checks;
Add the new container to the load balancer and verify it’s receiving & processing traffic;
Gracefully remove one of the existing containers from the load balancer (allowing it to finish processing in-flight requests);
Stop the old container;
Repeat until all containers have been replaced.

If a container fails to start or pass its health checks, the pipeline fails and we can find and fix the problem. If that failure happens on starting the first new container, we would still have a full set of existing production containers serving traffic. For our use case, this way of doing rolling updates provided us with sufficient confidence.

There are other deployment schemes / strategies of course. Some that you may want to investigate are canary and blue/green deployments.

Pipelines

A pipeline is – usually – a sequence of steps or stages. As code moves (flows) through the pipeline, those steps are invoked to perform a task. Tasks like building, packaging, testing, configuration management, orchestration, etc.

These tasks are repetitive, dull and mundane: they can (and should!) be automated. This leaves more time for tasks that do require human thought and involvement. Automation is central to a pipeline.

You can read more about CI/CD pipelines here: https://www.michielrook.nl/2018/01/typical-ci-cd-pipeline-explained/

Feedback on failing builds

A key thing in CD is feedback, the faster the better. In the case of a failing build, or a red build, we want to know about it and want to know about it quickly and effectively. In the case of this application, we chose to go for an LED siren lamp and speaker combination. A small script continuously polled Jenkins for build results. Whenever we had a failing build, the lamp started flashing and the speaker started blasting an audio clip.

This setup was called the “siren of shame” by some. Now to be clear, we had no intention of shaming anyone. However, the idea behind this form of “extreme feedback” is to alert the team to a failed build and trigger them into solving that.

Indeed, in the context of CD, the pipeline the only process that ultimately interacts with production servers and deploys our software to those same servers. The days of manually deploying things onto production servers are hopefully behind us! Thus, a pipeline that broke down becomes a high-priority issue to look into and fix.

Unfortunately, pipelines can break for many reasons and not all of them are easily fixed. Unstable network connections, package repositories failing, etc. These issues can be extremely frustrating but require attention.

Applying DevOps principles

Even though DevOps as a phrase has been hijacked by both recruiters and tool vendors, I’m still a firm believer and practitioner when it comes to the principles behind DevOps. I have embraced DevOps since way before the term was even popular, or at least before I was aware of it!

I consider DevOps to be primarily a cultural phenomenon. At the core it’s about developers and operational people working together where in the past they were separated. Working together and sharing pain. Because shared pain equals shared understanding equals shared commitment, which ultimately results in a better system.

As Werner Vogel once said: “you build it, you run it”. However, building is easy, running is hard. That requires a team that’s focused, committed and takes responsibility over the applications or services it owns.

Stay tuned for part 4!

Michiel Rook's blog