In my previous post I explained why I think you should use Jenkins (or his twin Hudson), Nexus, and Sonar to super-charge your Maven builds. To summarize, Jenkins is a continuous integration server that runs your builds, Nexus is an artifact repository that versions and stores your jars/wars/zips/etc, and Sonar is a metrics server that gathers code metrics and produces nice reports to help you improve code quality. All 3 products are free OSS and really useful. But scaling anything is hard. In this post I’ll talk about some of the challenges that you might face when you scale up a Jenkins infrastructure from a few builds a day to thousands of builds a day, and some tips to help overcome those challenges. In the following post, I’ll cover Nexus and Sonar tips.
The Demo Went Well, But Now The Honeymoon Is Over
When you first introduce any new piece of infrastructure, your biggest challenge is generally just getting it installed and working at all. That is surprisingly easy with the Jenkins/Nexus/Sonar stack. They install easily. They have good-looking, intuitive UIs and demo really well. They play nicely together. You figure, “Setting up this CI (continuous integration) thing is a slam dunk. I’ll be done by lunch.” And then you introduce your beautiful new system to the users. Uhhgg. The users. Everything worked fine until they showed up and started breaking it. In this case, the users are the various development teams within your organization that want to #1 build their code with Jenkins, #2 store their artifacts in Nexus, and #3 gather code metrics with Sonar.
Jenkins runs all of your builds so obviously it requires a lot of CPU for compilation, running tests, and static code analysis. You will quickly need multiple boxes to handle the daily load of CI builds. Fortunately Jenkins has support for building a server farm quite easily by running a “master” instance that distributes builds to many “slave” instances. In practice this works very well. Some tips:
Tip 1: Partition your master/slave clusters by something logical like development organization. Don’t put all of the builds on a single cluster unless you work for a very small organization. This is important for 3 reasons:
- It isolates development organization from each other so you can, for instance, restart or upgrade one cluster without effecting others and keeps “problematic” development organizations isolated (you know how you are…the kind of developer who put infinite loops in your unit tests and stuff like that).
- It allows you to configure security differently for each cluster which you may need to do if the groups within your company don’t like each other. Hey, you’re a DevOps, not a psychiatrist so just go with it.
- The Jenkins UI will get very messy very quickly with hundreds of jobs to wade through. It does have filtering features, but it is still slow to render in browsers. IE, I’m looking at you.
Tip 2: Early on, come up with a strategy to automate the creation of new master and slave instances. Two good options are using a provisioning tool like Puppet or Chef, or cloning a VM. One bad option is setting things up manually from memory.
This kind of automation is important because if you are scaling up (adding more and more development teams to your infrastructure) most likely you’ll end up: #1 adding more master/slave clusters, and #2 making global changes across your master/slave instances. For example, Jenkins has an awesome plugin community so it is likely you’ll be finding new and useful plugins often. Say you have 5 Jenkins clusters partitioned by development organization (a good choice for partitioning). You’ll have to install the plugin on all 5 master instances manually. And say you need to change something in the environment. Assume your 5 master instances each have 2 slaves also. Now you’ve got to make a change in 15 places. Your chances of fat-fingering a change goes way up plus who wants to do all of that typing? So get on board with the DevOps movement, and automate so your infrastructure becomes code.
Problem 1:
You configure Jenkins’ security permissions to allow the development teams to create / manage their own jobs. This is a problem because if you want any uniformity at all in your builds, then 500 developers all changing their jobs willy-nilly creates a mess in Jenkins. This makes any kind of global changes to Jenkins jobs using a script very difficult, and doesn’t allow you to use Jenkins to do any kind of enforcement of standards or provide a software “chain-of-custody” from source repository to production which can be a very big deal in a big company. Just having standards for Jenkins job names is actually very useful, and in a “free-for-all” model no standards can be enforced.
Problem 2:
The reverse. You configure Jenkins’ security permissions to restrict the development teams from creating / managing their own jobs. Instead only a select number of Jenkins admins can do that task. The Jenkins admins now have new full time job which is very un-fun: manually create jobs all day.
Solution:
The crux of the issue is that you want developers to be able to change some fields in a job like the source code URL to their project, but not other things like mandatory builds steps such as quality gates or auditing steps. You also probably want to prevent developers from using Jenkins’ cool-but-dangerous feature of allowing a job to run arbitrary script code on the server which obviously could do all kinds of mischief. Jenkins security permissions allow you to either create / manage a job in its entirety or not at all. What you really need is to set permissions on a field by field basis.
I don’t have the perfect solution for this problem. For some of you out there, the whole “Jenkins job management” problem isn’t a problem at all: just let the developers own their jobs and be done with it. I was on that side of the argument for a while, but experience has beaten me to down to the realization that some controls are actually a good thing.
There are 2 solutions I can think of. One is to create a Jenkins plugin that creates a new job type that is customized to your needs. I don’t really like that one.
My suggested solution is to create a simple “Jenkins job management” web application in your favorite rapid application framework (Rails, Grails, etc) that is used by developers to create jobs. This application only allows them to set the fields that are “safe” and behind the scenes does the job creation / maintenance via Jenkins easy-to-use REST API. This is the best of both worlds: self-service creation of jobs but with a measure of control.
Problem 3:
The builds run really slow.
Solution:
There are many, many reasons why this would be true, but there are 3 things I’ve found helpful aside from just buying bigger hardware.
The 1st thing is to profile your build. I wrote a simple AspectJ aspect (using load time weaving) to profile Maven builds and give timings for each Maven plugin that ran. That helped break down a 45 minute build into the different steps and help explain why it was taking so long.
The 2nd thing is to take all of the build steps that can be deferred until later and process them asynchronously. A CI job needs to run compilation and it needs to run tests to provide immediate feedback. You can’t defer those steps. But there are many others that you potentially can defer. For example, Maven site generation is slow. Running Sonar metrics can also be slow. So instead of running the Maven site and Sonar stuff during your build, run them asynchronously. This takes a little engineering but you are a software engineer, right? You could write a simple Jenkins plugin that puts a message in a queue after each successful build. Then have a process outside of Jenkins — potentially on a different server — read the queue, and run things like Maven site and Sonar. You can potentially make your builds much faster using this technique, and I’ve used it successfully.
The 3rd thing is to pay attention to the build time trend information that Jenkins provides, and automatically email developers if their build takes X% more than it used to. I’ve often see the root cause of a suddenly slower build is that a slow unit test was introduced. Fixing the test improves the build times, and it is much easier to find the offending test if you notice the slow down right away. You can get the build times via Jenkins REST API, so you can write a little script (Groovy anyone?) that is scheduled to run every night and checks the build times to provide rapid feedback if a job suddenly gets much slowers.
Problem 4:
No one pays attention to failing Jenkins builds.
Solution:
This is a real problem. Developers face a barrage of emails daily, and sometimes the job failure email is just one more to ignore. Obviously peer pressure is the main way to make people care about failing CI jobs. There are many Jenkins plugins that provide build notification, but my personal favorite way to get people to care is to bridge the virtual world into the physical world by building a CI orb. A CI orb has both a visual representation (some kind of light — traffic light, lava lamps, glowing orb) and an audio output (“Joel has broken the build”). Instructions for a cool one that I’ve personally seen can be found here. The CI orb is not just a gimmick — it really does work. It is easy to ignore 1 or 2 failing Jenkins jobs out of 50. But it is much harder to ignore a large pulsating red orb next to your boss’s cube, or your name being read on a loudspeaker. A CI orb helps your team realize the purpose of CI which is to respond to failures quickly.
In the next post, I’ll give you some tips for Nexus and Sonar.