I just saw the announcement for the Biggest Update for Jenkins in a Decade - in this announcement - horizontal scalability was one of the stars and is enabled with a new feature called HA Mode in CloudBees CI. This feature promises to make your monolith more performant and stable! Haaza! What it doesn't promise is to fix your monolith.
Ok - hear me out - monoliths happen, and if there is a way to keep them stable and not lag in performance during peak times - I’m all for it. Horizontal scalability in this use case can help alleviate some of those pains by spreading the workload across different replicas and removing the single point of failure of the monolith, thereby making the controller performant and stable. Fantastic!! It makes a replica of your monolith, meaning the monolith still exists. The 10,000 job-overloaded controller has a replica friend, and together, they work hard at keeping those services up.
I’m sure you have seen the articles about breaking up the monolith, and you know that turning a monolith Jenkins controller into smaller, faster, and more team-focused controllers has many benefits. Implementing this kind of change takes time. A LOT of planning and time. Breaking up a monolith includes best practices, correct sizing, and a change in Jenkins Admin culture. And culture itself takes time to change; just look at DevOps.
So, for this article - let's concentrate on those first steps - the baby steps - of getting your monolith controller to be as performant as it can. Let's assess your monolith and do a data clean-up. While we do this - we can rely on horizontal scalability to alleviate the pain of overloaded monolithic controllers by keeping services up and running - even during peak usage or routine infrastructure maintenance. From this point, you can plan your move on your timeline to a more robust architecture - whether with microservice architecture running on Kubernetes or remaining on-premise and splitting up your monolith into team-based controllers. But first…
Let’s go back to the beginning
How did Jenkins/CloudBees CI admins(users) get here? Where did this monolith come from, and why is it still here? There are many answers to these questions and the ones I’ve heard the most are shared below. But, the most beginning of beginning question is - what is a monolith?
What is a monolith?
The term "Jenkins Monolith" refers to the practice of teams utilizing a single Jenkins controller with a large number of jobs and build history. It usually also has numerous plugins and integrations, supporting multiple developers' code.
How did you get here?
It could be your one and only Jenkins controller that every developer uses.
It happens organically - you may have already been using Jenkins for years and have one or many of these large instances. The reason this happens often is because it's easier for a Jenkins admin to just onboard new users onto an already existing controller. Even if it has grown considerably. Then, if it starts to slow down, more resources get added to the controller and continue to grow. At first, everything is working okay. But eventually, problems start recurring down the road when it gets too large.
Or perhaps in the spirit of getting this up and running quickly - some bad practices and anti-patterns crept in: too many plugins, too many jobs, too much groovy code, etc.
Challenges with Monolith
There are MANY ARTICLES on this, but here are the top 4 challenges I have seen:
Poor Performance: The biggest problem you'll run into is poor performance. A restart that used to take five minutes now takes 20, and certain pages or all pages start taking longer to load. The time it takes from clicking the Build button to the build actually finishing starts taking longer. When these issues start to arise, the first response is usually to scale vertically by throwing more resources at it but this is just a band-aid. When an outage eventually occurs, many teams and users are affected and there's a lot of pressure to get the controller back online. The bigger the controller the longer it can take for recovery.
Harder to troubleshoot what is causing the poor performance: Some other symptoms could be the jobs having a major effect on the performance of the controller are much harder to pinpoint. The bad build could be hidden by the many other builds that are running at the same time. As different teams get onboarded, they have different needs in terms of tools and plugins to perform their builds. This leads to a growing list of plugins being installed and a fear of updating them because something might break.
Permissions: managing permissions for all the different users can become a challenge. They can all require different levels of access and abilities across the controller.
Configuration Complexity: A Jenkins controller that is shared between several teams in an organization might end up with a large number of jobs and significant configuration complexity. As a Jenkins administrator, you might find your systems suffer from performance issues, overly complex configuration, and difficult maintenance requirements. It is also possible that teams in your organization are impacted by changes performed by other teams.
The BENEFIT of breaking up the monolith:
Scalability: Horizontally scale with many smaller controllers by shrinking the size and distributing the load. You can avoid a lot of the problems that come with monolithic controllers by doing this.
Better Performance: The controller will be faster all around. You'll have a smaller number of users on each controller if there is only one team using the controller and may be able to just let them self-manage some parts of it. This will also mean there are fewer plugins to install, audit and update.
Uptime: If an outage occurs, you will likely have less downtime due to fewer possible causes and faster startup. That will also affect a much smaller group of users which should help alleviate some stress from the situation. (If you are using HA Mode, you may not experience downtime at all!)
Freedom: If each team has their own controller - no contention of plugins, or plugin versions - they have freedom to choose what integrations work for their own setup.
Unlimited controllers and executors: the ability to scale according to demand, not to price.
Cloud Migration: Set the stage for a cloud migration and microservices
The CHALLENGE of breaking up the monolith
It’s not easy to split up jobs into new controllers
Harder for specific industries to get approvals for more infrastructure - the reality is it could be 6+ months to split up controllers and scale out
Monolith mindset (Pay by executor) vs Growth mindset (pay by user, scalability, growth in teams and business). The size of your CI infrastructure should not be dictated by the terms of contracts - it should depend on the build work that serves your business goals. For example - if you have cloud on your mind, then under k8s, your infra is ephemeral when it needs to be, and stateful when it needs to be.
Baby Steps to a Smaller and Performant Controller (aka Best Practices)
The first step in making a performant controller is to clean up your data. Cleaning up your environment will ensure you are not migrating unused plugins and unnecessary build artifacts when you take the next step in breaking up the monolith. Fixing these issues now will save you the time of having to fix them on your new controllers. And, if you are not ready to make the architectural jump to splitting your monolith to controllers per team, you will at least have a more performant controller that will be resilient with HA mode. So, let's get into it - here are a few baby steps to a healthier monolith.
Limit your build storage: A good starting point may be to keep only the last 10 builds for each job. Here are some references to dig more into this task:
Remove your Plugin Clutter: Determining which plugins are not being used anymore so they can be removed. Most cleaners will disable the plugin first for a period of time to make sure it truly isn't needed before removing it. The Cloudbees Plugin Analyzer will generate a weekly report that will show how often the plugin is being used and where the plugins are being used.
Remove Inactive Items: Using the CloudBees Inactive items plugin to find jobs that could potentially be removed. CloudBees Inactive Items Plugin will analyze different elements in your instance and will help you find inactive items you can safely remove. This will allow you to go further in disk space optimization as you will be able to get rid of unused and potentially legacy elements in your instance. Here is a quick video of this in action:
Disable Inactive jobs: Look for any job that hasn’t been run in say 18 months, then disable it, which gives a low risk way of turning it off. If no one asks about it in 6 months, remove it completely.
Check your controllers' health: Use the Jenkins Health Advisor by CloudBees to check for any potential issues. This will scan your controller and create a report for any potential issues that are known. This can inform you of flooding or configuration related issues, known bugs as well as best practices to follow. Ensure Healthy Jenkins Controllers with Jenkins Health Advisor from CloudBees
Look for large folders: A folder with a large number of items inside can be very slow to load. You may have specific teams that were associated with large folders. In one case i saw, they had a folder with 185 repos connected to that. They had found some (unspecified) automation in GitHub which created a lot of branches, and then multibranch pipelines pulled a lot of branch jobs onto the system.
Garbage Collection: Because Jenkins is a Java Application, most of the performance issues can be remedied by analyzing three things: Garbage Collection Logs, Thread Dumps and Heap Dumps. Check out this Garbage Collection article and this video to help walk you through this: Configuring Garbage Collection for CloudBees CI Running Java 11
Update your plugins: Update them to the latest compatible version and test them for some time before migrating the data. Updating the plugins now will save you the time of having to update them across multiple controllers later.
I highlighted a few steps to help clean up the monolith for better performance. The next step would be splitting it up, or it could be moving to the cloud! This is the hardest step and takes more consideration than can be covered in this article (which is already too long). When you are ready - we have guides and best practices to follow when you embark on this task.
CloudBees CI Modern Best Practices for Reliability (Moving on-prem to the cloud)
Or you could even reach out to our DevOps Consultants to talk about your specific environment and what this next step would look like.
I’ve seen a few success stories in my days, and they all started with an assessment - where are you now and what can you do to improve the performance of your controller? I’ve had customers take the baby steps of breaking up the monoliths so they could port things over easily. It takes time to carefully plan, map out your infrastructure, and even map out your digital transformation if you are feeling particularly spry and innovative. You will come out the other side with faster release cycles, happier dev teams, happier infrastructure teams, and so much more.