Scaling Effortlessly: How Jenkins, Karpenter and EKS Redefines CI/CD

Jenkins has served as the backbone of the CI/CD landscape for over a decade. Throughout these years, CI/CD practices have transformed from jobs executed in companies’ own data centers to those running in the cloud. Jenkins has adapted and evolved throughout this time, remaining a workhorse in the ever-changing CI/CD domain.

If you looked at a typical AWS-based Jenkins setup, you would probably see a master Jenkins node running in EC2. When initiating a job, the master node dynamically spawns an EC2 worker instance to execute the task, subsequently terminating the worker node upon completion. This setup saves time and money from the old ways of having your master and worker nodes on dedicated hardware, always running even if there are no jobs to run.

Even though running Jenkins in EC2 saves time and money, we can do better. For example, instead of spawning a whole EC2 instance for a single job, we can spawn it in a Docker container. CI/CD platforms like GitHub Actions and Circle CI can do just this; however, we don’t have the configurability and full control of worker nodes that Jenkins gives us.

We can solve this by once again evolving Jenkins to execute jobs within pods rather than EC2 instances and using tools like the Jenkins Operator and Karpenter in EKS to achieve this.

Here is an example of Jenkins running in EKS using Kaprenter node scaling, let’s get started!

Prerequisites

Running EKS Cluster and permissions to create IAM roles and policies.
AWS CLI is installed and configured.
The Kubernetes command-line tool (kubectl) is installed.
Installing services using Helm

Jenkins Operator

I will be installing the Jenkins Operator to manage our CI/CD environment in EKS. The operator will create and maintain our Jenkins master server, seed test jobs from a GitHub repo, and manage worker pods that have jobs spawned from the master.

I’m using Helm to install the operator with the following values.yaml file:

jenkins:
  seedJobs:
  - id: jenkins-operator
    targets: "services/jenkins/cicd/jobs/*.jenkins"
    description: "Test Jenkins Jobs"
    repositoryBranch: blog/jenkins
    repositoryUrl: https://github.com/drogerschariot/gitops-playground
  basePlugins:
  - name: kubernetes
    version: 4029.v5712230ccb_f8
  - name: workflow-job
    version: 1342.v046651d5b_dfe
  - name: workflow-aggregator
    version: 596.v8c21c963d92d
  - name: git
    version: 5.2.1
  - name: job-dsl
    version: "1.85"
  - name: configuration-as-code
    version: 1670.v564dc8b_982d0
  - name: kubernetes-credentials-provider
    version: 1.234.vf3013b_35f5b_a
  - name: prometheus
    version: 2.5.0
  enabled: true
  namespace: jenkins
  latestPlugins: true
  resources:
    limits:
      cpu: 500m
      memory: 1.5Gi
    requests:
      cpu: 250m
      memory: 1Gi
  volumes:
    - name: backup 
      persistentVolumeClaim:
        claimName: jenkins-backup
  backup:
    enabled: true
    pvc:
      enabled: true
      size: 5Gi
    resources:
      limits:
        cpu: 100m
        memory: 500Mi
      requests:
        cpu: 100m
        memory: 500Mi
    env:
      - name: BACKUP_DIR
        value: /backup
      - name: JENKINS_HOME
        value: /jenkins-home
      - name: BACKUP_COUNT
        value: "3" 
    volumeMounts:
      - name: jenkins-home
        mountPath: /jenkins-home 
      - mountPath: /backup 
        name: backup
cert-manager:
  startupapicheck:
    enabled: false
operator:
  replicaCount: 1

Let’s go through some important configurations in the values.yaml file. Here is the git repository, branch, and path which has the DSL seed job code the Jenkins Operator will use to sync pipelines. I will talk about the seed job process later:

seedJobs:
- id: jenkins-operator
  targets: "services/jenkins/cicd/jobs/*.jenkins"
  description: "Test Jenkins Jobs"
  repositoryBranch: blog/jenkins
  repositoryUrl: https://github.com/drogerschariot/gitops-playground

Here is the resource definition of the Jenkins master instance, and where backups will be saved:

resources:
  limits:
    cpu: 500m
    memory: 1.5Gi
  requests:
    cpu: 250m
    memory: 1Gi
volumes:
  - name: backup 
    persistentVolumeClaim:
      claimName: jenkins-backup

Installing the Operator

Run the following to install the operator and use the values.yaml mentioned above:

$ kubectl create namespace jenkins 
$ helm repo add jenkins https://raw.githubusercontent.com/jenkinsci/kubernetes-operator/master/chart 
$ helm install jenkins jenkins/jenkins-operator -n jenkins --values values.yaml

1. Watch Jenkins instances being created:

 
$ kubectl --namespace jenkins get pods -w

2. Get Jenkins credentials:

$ kubectl --namespace jenkins get secret jenkins-operator-credentials-jenkins -o 'jsonpath={.data.user}' | base64 -d 
$ kubectl --namespace jenkins get secret jenkins-operator-credentials-jenkins -o 'jsonpath={.data.password}' | base64 -d

3. Port forward to the Jenkins master running in the cluster:

 
$ kubectl --namespace jenkins port-forward jenkins-jenkins 8080:8080

Now just browse to http://localhost:8080 and use the credentials from the command above.

Seeding jobs

The seeding process in the DSL Jenkins plugin involves creating and managing Jenkins jobs using Groovy scripts, often referred to as Jenkins Job DSL scripts. These scripts live as IaC, usually in a git repo. By following this process, you can use automation to manage and maintain Jenkins job configurations efficiently, especially in environments with many jobs or frequent changes.

For my tests, I have 4 jobs located at https://github.com/drogerschariot/gitops-playground/blob/blog/jenkins/services/jenkins/cicd/jobs/k8s_jobs.jenkins. When you run the seed job, it will sync with the repo and make changes if they exist:

Running Jobs

When a job is started, the Jenkins operator will start a pod, which is defined in the DSL PodTemplate. Let’s focus on the podTemplate() function:

 
podTemplate(
    label: label,
    containers: [
        containerTemplate(
            name: 'build-npm', 
            image: 'alpine:3.11', 
            ttyEnabled: true,
            resourceLimitCpu: '500m',
            resourceLimitMemory: '500Mi',
            resourceRequestCpu: '250m',
            resourceRequestMemory: '250Mi'
        )
    ],
)

This template will define the pod for the “build-maven” job. The benefit of this is that every Jenkins pipeline can have resources and a Docker image specifically designed for the task.

Here I run 3 build-maven and 3 build-npm pipelines; notice the pods running compared to the executers.

The controller will read the pod template, then create the pod for the executors to connect to. When the job is done, the controller will remove the pod. This drastically improves the speed of pipelines compared to executors starting EC2 instances per task.

However, what if there are no more resources available and you start seeing the dreaded “Pending” status because there are no more physical nodes left? EKS has two ways of node scaling, and one of them works perfectly with the Jenkins operator.

Karpenter

In the past, AWS would recommend using the Cluster Autoscaler for node scaling in EKS. The autoscaler would be controlled by a pod inside EKS that monitored pending pods, then updated the managed node group’s ASG to scale up or down. This works fine; however you are tied to the ASG that you are scaling and don’t have much control over defining how scaling happens in different scenarios.

Now AWS recommends using Karpenter when doing just-in-time node scaling. Karpenter gives us the NodePool CRD, so we can now define exactly what and how we want the physical nodes to scale and not be tied down to an AWS Autoscaling Group.

Using Karpenter with the Jenkins Operator gives us even more options to speed up pipelines and save money. Let’s look at using Karpenter vs. the cluster autoscaler.

IAM Permissions

Note: I won’t go into the permissions needed for the cluster autoscaler; however they are very similar to Karpenter.

Karpenter uses pod identity and a Kubernetes service account to create EC2 instances and add them to EKS, so we need the following IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}

and policy:

{
  "Statement": [
      {
          "Action": [
              "ssm:GetParameter",
              "ec2:DescribeImages",
              "ec2:RunInstances",
              "ec2:DescribeSubnets",
              "ec2:DescribeSecurityGroups",
              "ec2:DescribeLaunchTemplates",
              "ec2:DescribeInstances",
              "ec2:DescribeInstanceTypes",
              "ec2:DescribeInstanceTypeOfferings",
              "ec2:DescribeAvailabilityZones",
              "ec2:DeleteLaunchTemplate",
              "ec2:CreateTags",
              "ec2:CreateLaunchTemplate",
              "ec2:CreateFleet",
              "ec2:DescribeSpotPriceHistory",
              "iam:GetInstanceProfile",
              "iam:CreateInstanceProfile",
              "iam:TagInstanceProfile",
              "iam:AddRoleToInstanceProfile",
              "iam:PassRole",
              "pricing:GetProducts"
          ],
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "Karpenter"
      },
      {
        "Sid": "AllowInterruptionQueueActions",
        "Effect": "Allow",
        "Resource": "${aws_sqs_queue.karpenter_queue.arn}",
        "Action": [
          "sqs:DeleteMessage",
          "sqs:GetQueueUrl",
          "sqs:ReceiveMessage"
        ]
      },
      {
          "Action": "ec2:TerminateInstances",
          "Condition": {
              "StringLike": {
                  "ec2:ResourceTag/karpenter.sh/nodepool": "*"
              }
          },
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "ConditionalEC2Termination"
      },
      {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::<AWS_ACCOUNT>:role/KarpenterNodeRole-<EKS_CLUSTER_NAME>",
          "Sid": "PassNodeIAMRole"
      },
      {
          "Effect": "Allow",
          "Action": "eks:DescribeCluster",
          "Resource": "arn:aws:eks:${var.region}:<AWS_ACCOUNT>:cluster/<EKS_CLUSTER_NAME>",
          "Sid": "EKSClusterEndpointLookup"
      }
  ],
  "Version": "2012-10-17"
}

When we attach this role to Pod Identity associations, the Kapenter operator will have access to create EC2 instances and add them as worker nodes in EKS:
iam

Install Karpenter

We will use Helm to install the Karpenter Operator:

$ helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "0.35.0" --namespace "kube-system" \
--set "settings.clusterName=my-eks-cluster" \
--set "settings.interruptionQueue=my-eks-cluster" \
--set controller.resources.requests.cpu=250m \
--set controller.resources.requests.memory=256Mi \
--set controller.resources.limits.cpu=500m \
--set controller.resources.limits.memory=512Mi \
--wait

We see the operator running:

$ kubectl get pods -l "app.kubernetes.io/name=karpenter" -n kube-system
NAME                         READY   STATUS    RESTARTS   AGE
karpenter-84749cc94f-qsxn5   1/1     Running   0          5m21s
karpenter-84749cc94f-xczdw   1/1     Running   0          5m21s

The Karpenter operator will install the NodePool and EC2NodeClass CRDs. Here is an example of 3 NodePool configs using one EC2NodeClass:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: small
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: node.kubernetes.io/instance-type	
          operator: In
          values: ["t4g.nano", "t4g.micro", "t4g.small", "t4g.medium"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot"]
      nodeClassRef:
        name: default
  limits:
    cpu: 250
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: large
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: node.kubernetes.io/instance-type	
          operator: In
          values: ["t4g.xlarge"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot"]
      nodeClassRef:
        name: default
  limits:
    cpu: 250
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: on-demand
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: node.kubernetes.io/instance-type	
          operator: In
          values: ["t4g.xlarge"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
      nodeClassRef:
        name: default
  limits:
    cpu: 250
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: "karpenter-node-role"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-eks-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-eks-cluster

Lets go through each NodePool:

NodePool small:
- Only use t4g.nano, t4g.micro, t4g.small, t4g.medium EC2 types
- Only use spot instances
NodePool large:
- Only use t4g.xlarge EC2 types
- Only use spot instances
NodePool on-demand:
- Only use t4g.xlarge EC2 types
- Only use on-demand instances

Now we can choose which NodePool our Jenkins jobs will use when we add the nodeSelector: 'karpenter.sh/capacity-type=spot' property to our PodTemplate.

 
podTemplate(
    label: label,
    nodeSelector: 'karpenter.sh/capacity-type=spot', // Matching label for NodePool
    containers: [
        containerTemplate(
            name: 'build-npm', 
            image: 'alpine:3.11', 
            ttyEnabled: true,
            resourceLimitCpu: '500m',
            resourceLimitMemory: '500Mi',
            resourceRequestCpu: '250m',
            resourceRequestMemory: '250Mi'
        )
    ],
)

To test this I will run 4 pipelines all at the same time:

build-npm: requires 500m CPU and 500Mi Memory and spot instances 15 times
build-maven: requires 1000m CPU and 1Gi Memory and spot instances 15 times.
build-npm-large: requires 4000m CPU and 2500Mi Memory and on-demand instances 5 times.
build-maven-large: requires 4000m CPU and 2000Mi Memory and on-demand instances 5 times.

Here are the results when using cluster autoscaler (with time sped up):

Here it took 5:40 minutes to run all 30 pipelines. The cluster autoscaler needed to spin up 8 t4g.xlarge instances for 17 minutes to shutdown all added nodes. This turns into costing $0.1692 to run all tests.

Now here is the same test using Karpenter:

Here it took 3:56 minutes to run all 30 pipelines. Karpenter needed to spin up 8 t4g.xlarge and 6 t4g.small instances for 6 minutes to shutdown all added nodes. This turns into costing $0.05424 to run all tests. Even though Karpenter spun up 6 extra instances, it was able to use smaller, cheaper instances for most of the tests, and once done, shut the instances down quicker than the cluster autoscaler saving 70% of the cost compared to the cluster autoscaler.

Conclusion

Jenkins has significantly evolved from its early days as a standalone CI/CD tool to become a more robust, scalable, and efficient solution when running in a managed Kubernetes environment like EKS. The Jenkins Operator simplifies Jenkins’ deployment and management in Kubernetes environments and takes advantage of dynamic scaling capabilities.

Integrating Karpenter with Jenkins introduces an innovative approach to resource management, optimizing the use of computing resources and reducing costs. Karpenter’s ability to automatically adjust resource allocation based on workload demands ensures that Jenkins runs more efficiently, providing a seamless CI/CD pipeline that is both cost-effective and time-efficient.