Slow CI: real problem or easy excuse for developers?

When you write software, you can end up in a place where your continuous integration takes too long to execute. For most of the post I will write CI because it’s faster to type. Also it makes your read slightly faster.

When delivering software continuously, the more complex solution, the more you are prone to having a slow CI process at some point. Note that a limited computing power also increases the chances of slow CI becoming your burden.

Let’s start with something we can all agree on. Frustration. The frustration to finish a bit of work and push it only to wait a long time for that CI run green tick. By long I mean beyond thirty minutes. In my experience, shortcuts are taken when setting up continuous integration. Those shortcuts are not spawn of evil minds deciding to sabotage a project. Often you will be responsible for it. Most likely the “Getting started” CI setup you googled did exactly what you needed. No need to look further, it does the job. Documentation? You ain’t gonna need it, right?

Most of the time, it will actually do. But sometimes, it doesn’t. Sometimes, these decisions come back to haunt you as you extend your build validation process. Unfortunately, there are times where you do everything by the book and still have a slow CI. Is it actually your main problem? Can you work around having that slow CI? Do you need to make it fast no matter what? That is today’s topic so let’s just jump into it.

What can make your continuous integration slow?

Expensive test setups: Unit tests are usually fine as they concern a tiny bit of code executed in isolation which should be fast enough not to notice whether you have one or hundreds. When you start moving to in components and integration tests this is where pain arises. I am going to list issues I encountered and ways they were or could have been solved.

Repeated expensive run setup

I have seen more often than I would like whole environments as docker images being setup and torn down. For. Each. Freaking. Test. The purpose there was to have some virginal databases to validate data migration scenarios. One way to speed things up in that scenario is to execute your setup once for your whole CI. From there you can have database clean-up scripts executed between each test. It might take a few more lines to originally setup but it will be worth it in the long term.

If you want to pick it up a notch you may even create your own docker image with your environment already setup. From there, make your CI system pull it and voila. On a project I worked on recently, the environment setup took up to four minutes only to load docker images and setup databases inside. Having an image ready with the right configuration would save four additional minutes on each CI run. Feel free to play with the concept for some time before moving it anywhere near production setup you have.

I realise I have been mentioning Docker and databases for integration tests in this part. However, having your expensive setup executed once before all your tests should always be your priority whenever possible.

Imbalance in your testing

Another thing that may make your build slow is the wrong balance in your tests. My personal opinion is that one should handle most scenarios in a system at unit test level. Component, system and integration tests should be covering happy path scenarios and basically making sure all the pipes connect correctly and do what they should.

Too often I see people fixing bugs using assurance at the wrong level which results in a bunch of new tests more expensive than needed. I should probably write something to go deeper on that topic at some point. Your software may be too coupled for that and maybe your only option is to test the whole thing at once. A way of doing it is with by leveraging the power Postman collections coupled to Newman. This is fine, it is always better to have slow tests rather than no tests. However, but you should use your slow tests as clutches to refactor and make room for cheaper, faster tests. You will get to reap the benefits of that approach soon enough.

Other reasons?

So far, the two previous sections covered what I experienced directly but I might update this post as I find more scenarios making builds slow. But now let’s jump into what actually grinds my gears.

Justifying slow CI slow as your main problem

We could deliver faster

Darn right you could! A reason I often hear to speed up slow CI runs as a priority is that they hurt the development and delivery cycles. Delivery could be faster and stuff. To that I’ll say: Plaît-il? Of course it might play a role but from my experience it has never been the been a factor big enough to justify deliveries not happening on time.

A Netflix engineer commits code and sees it in production within days. This is the kind of cycle time I can only dream of in a commercial context. Personal projects aside the only times I saw code deployed (or even merged!!!) within a few hours/days following its committing has been only to fix production issues.

As far as I have observed the time spent between code reviews, meetings, demos prior to a release take a much more significant amount of time. It reaches the point where statistically cumulated slow CI runs for the code that gets released is negligible in comparison. You want to release faster? Fine. Seek to optimise these processes before even considering touching that “monstrously” slow CI. Moving on.

I work on multiple branches for a feature I’m delivering

First of all, don’t. Just don’t do that. Git has a lot of really cool features that allow you to do most of what you need in a single branch. You get to revert some changes, cherry-pick some in a branch somewhere. You can even unify all your work within a single commit once you’re done for a cleaner history.

However, if you do chose to work on multiple branches you intend to merge into one that you will get reviewed, don’t push them all to make sure they pass CI. Also don’t open pull requests to yourself to merge them. Unless proven otherwise these branches don’t interest anyone but yourself. Nobody cares that you reviewed yourself or got 20 branches with a green build when you are to put them into one to review. All I will care about as a reviewer is that one branch you will submit for review. Simply merge them to your final branch, fix any conflict and issues you find, then push it. There you go, one CI run, maybe a long one but still one.

I realised I needed to do a change after pushing code

This is not an invalid point but as far as I am concerned. Knowing how long your CI takes to run you may want to double check yourself before pushing stuff. Also, if you notice that you need changes after half-way through a hour long CI run the process being long has no impact. You would have thought about these changes at the same time in the same manner. Maybe the little feeling of panic that made you realise you needed to add some changes is thanks to the slow CI. Be grateful, you could have been in the 70s where people had to take flights to fix bugs using punched cards as software. Let’s keep going.

We didn’t factor the CI runtime in our estimates

Let’s say you’re using the Scrum toolkit to develop your software increment. That increment needs to be “Done” by the end of the sprint where you start working on it. Before the sprint begins you have a planning session taking in consideration what can and cannot be done before making it piece of work a part of the next sprint. When that planning occurs, unless it is the very first sprint, you have an idea of how long your CI takes to run and should factor it in your estimates.

One thing I have been experiencing is having a Sprint 0 to spike, groom the first project tickets and setup CI. If you do that, when you start estimating work for a sprint, you know how long your CI runs take. Then if it becomes a problem for the Product owner you can still create a ticket to try and make your CI run faster. Too often this is not taken into consideration when defining estimates. That also works for demos, code reviews, etc. Time to jump in the danger zone.

Dangerous thoughts

Dropping tests a.k.a. Seppuku without the honour

At times, there is simply nothing you can do to make your CI any faster. Maybe you already did every single thing you could do to improve the runtime. Yet you still have to wait potentially hours to have a build pass. You may be tempted to summon a meeting to complain about how slow the build is. Heck, you may even get a task force together to figure any way to make things faster.

However, if everything that reasonably could be done has been done already, all that can come out of complaining can become dangerous. You may want to stop running some of the tests that validate the behaviour of your software in production. Maybe you think it will be fine, that everyone can keep running all the tests they think their change affect on their local. If it was that easy we developers would not need continuous integration to begin with. If you feel pressure by tests running themselves in the cloud, why would that be any different when you have to run them on your machine? You know deep down you will skip the hell out of them especially if you’re getting closer to the end of your sprint.

From that point onward, only trouble can emerge. New bugs get introduced without anyone noticing in code we believe is covered by the CI when writing the original code. Nobody will double check because every member of your team relies on CI. They will trust you to run the missing tests on your local but most likely you won’t run them. Especially not for every change you make even though Kent Beck recommends it in his Refactoring book. It is possible to get these ran automatically in your machine background as your write code but it quickly becomes expensive. Then writing code becomes slow and everyone loses. Unless you have duplicated tests or they overlap 100%, don’t even think about dropping them.

Altering the software configuration for CI runs

I heard that suggested once to tackle slow continuous integration. Absolutely. F$&k^£@. Not. More seriously, that’s the best way to deal with bugs at CI level that would not happen with the production config. Spoiler alert, it gets better. You can introduce production bugs that will not be picked up through CI due that configuration difference.

I remember that one time on a project looking at a deployed bit of software. It was freshly pushed to production after going through reviews, automated testing and deployment. It didn’t work. Why? I hear you think. Well, after asking a colleague I found out that you had to comment some lines for it to work. After it went through our release process. In f8^%&*! production! Luckily that system ran offline but still I was flabbergasted. How do we even know that it does what all our validation process says it does? Obviously, bugs ensued but that’s a post for another day.

So should you ignore slow CI runs?

No you should not. Personally I would try to see what I can do to have an impact to speed up delivery. Every single time I looked at a project I worked on, slow CI ranked very low on the list. However, code reviews very often ended at the top of the list shortly followed by lengthy release processes. The lengthy release process is not an issue in itself, you’re free to release code yearly if you want to. But you need to target what actually slows you down at a broader scale than your work day.

This is where you need to ask yourself a few questions. How good is it to have CI run in seconds when the result will take hours to days before a first review comment? How good is it to have a short CI runtime when the software will not be delivered for weeks, months, years?

I don’t see these questions asked often enough but optimizing stuff for the sake of it is a trap too many of us fall into. I won’t hide I fell into that trap more than I would like and I know it will happen again. I’ll just keep learning all the shapes and forms these take and grow from it.

Once again, thanks for reading.