Our Story and Humble Beginnings

Taboola, a company that has more than 500 million daily active users, has been an early adopter of Kubernetes on a massive scale. In the past, the company used to run its Kubernetes clusters using Systemd-based services and deployed them using Puppet as the configuration management tool. However, as Kubernetes releases became more frequent, we found keeping up with the latest updates and new features challenging.

Before Kubernetes, we used Nomad clusters to run our containerized applications. However, we quickly realized we needed a more agile and scalable infrastructure. We then turned to Kubernetes and used the “Kubernetes the hard way” project, which was written by Kelsey Hightower from Google, to initiate our clusters.

Scaling Up Infrastructure: Challenges with Upgrades and Modifications

As the company’s infrastructure grew, we found modifying and upgrading using Puppet increasingly difficult. The company’s infrastructure consisted of 7 large-scale Kubernetes clusters across nine on-premise data centers and tens of thousands of cores. The Puppet mechanism, which works by representing the state of the infrastructure and modifying it to match the desired state, created challenges for the platform team and the service level it provides as it could not perform gradual modifications on the cluster without causing downtime.

Consequently, the team accumulated significant technical debt, and we could not keep up with the fast-paced Kubernetes release cycle. We were running an outdated version of Kubernetes while newer versions were already available.

The Kubernetes release cycle has also undergone a significant change in recent times. In April 2021, the open-source Kubernetes release team shifted from four to three releases per year. This move ensured a more predictable release schedule and improved software release quality.

Despite this change, it’s still crucial for companies to keep up with the latest updates and new features in Kubernetes, as it’s a rapidly evolving technology that requires continuous learning and adaptation.

Revolutionizing Infrastructure Management: Our Journey from the Hard Way to Kubeadm

To overcome these challenges, we designed a solution to allow us to initiate and re-initiate clusters as quickly as possible.

We moved away from Puppet and adopted a more agile and scalable infrastructure management approach.

The Kubeadm, a tool to initiate and manage Kubernetes clusters, had become stable and been increasingly adopted. The team created a new development cluster separated from the production infrastructure for testing and experimentation. This approach allowed us to build, break, and rebuild faster. It helped us catch up with the latest version of Kubernetes.

Upgrades Management Process

After restructuring our infrastructure, upgrades and maintenance work became vastly more manageable. We created a process to apply upgrades on our clusters with zero downtime for production services. We divided the process into three steps:

  • First, we gather and decide on all the new software versions that construct our infrastructure.
  • Second, we build and test the upgrade process with our development cluster.
  • Third, we schedule and upgrade our production clusters.

The first step is to plan the upgrade process. It is more of an engineering job than an execution job. We list all the infrastructure applications that construct our Kubernetes platform and check the next stable and LTS version for them. We never take the latest and greatest because we prioritize stability over features. We check the compatibility of those applications with the new API versions. We also check what versions are highly adopted on other large cloud providers.

In every Kubernetes version update, we must upgrade essential infrastructure applications like the Calico CNI, RookIstio and more, which is a challenging upgrade process to plan.

We test every procedure we plan on the development cluster. We pick versions and then simulate the entire process. After that, we have an idea of what will happen in production. Usually, we need to tune versions due to incompatibility, or some applications already have new minor versions by the time we simulate. We run the process dozens of times until we perfect the technique, And only then do we schedule a maintenance day to apply on production. We call it the “Kung Fu” method, and it works.

Automated Upgrade Service

Once we start to upgrade, we actively operate and monitor upgrades of the control plane nodes, as they are inherently more sensitive than regular worker nodes. We automated the worker nodes upgrade process for the rest of the cluster with a service we wrote in Golang.

We designed the service to manage the entire process from start to finish. To do it, we set the following goals in mind:

a. Run the upgrade process in the background without human intervention.

b. Zero downtime for the services that run on the platform.

The service will continuously monitor the cluster’s health and resources. It will pause if there are insufficient resources to operate the cluster or if the cluster health gets degraded.

We wanted the service to have similar functionality to the rolling deployment mechanism of Kubernetes in the sense that when the upgrade is stuck, or the cluster is unhealthy, it will cease to continue. When the cluster comes back to a healthy state, it will resume the upgrade. The upgrade service can be configured with options like “max surge” and “max unavailable” as with the rolling deployment feature of Kubernetes. The service can also be configured with options like: upgrade one or multiple nodes simultaneously, the number of nodes that can be operated at once, the number of failed nodes that will halt the entire upgrade process, the waiting time between operations, and more.

The service has two major components: a controller and an executor. The controller monitors the cluster and decides which nodes to upgrade.

The controller marks those nodes, drains them, and initiates a single executor for each marked node to operate.

The executor is a separate unit from the controller and can be replaced with different functions. We used the Ansible configuration management tool for the upgrade process, as it required running Linux commands and the Kubeadm tool commands. The executor is a Pod launched via a Job manifest by the controller. The image of the Pod contains Ansible installed on a tiny Linux distribution. The entrypoint runs an Ansible playbook to upgrade the node to a newer version.

The executor can be used dynamically to run different playbooks on demand. We used Kubernetes ConfigMap as a mounted volume to pass the playbook file into the container. The executor is scheduled on one of the control-plane nodes to avoid conflicting with a node in the process of an upgrade. If the executor fails, the Job will reschedule the pod until it succeeds or has exhausted all attempts.

The upgrade playbook covers all operations needed to upgrade a node, including installing new packages, modifying configurations, and restarting appropriate services.

After the executor finishes upgrading a node, the controller runs tests to ensure proper functionality. If successful, the node will be returned to the cluster. The controller continues upgrading nodes until the cluster becomes unhealthy or all nodes are successfully upgraded. The executor can be replaced with other executables, making it useful for additional operations.

Automated Upgrades for Error-Free Large-Scale Deployments

One of the benefits of using an automated upgrade service is the reduction in likelihood of human error, which is especially important in large-scale deployments like Taboola’s. With an automatic process, the team can be confident that each node is upgraded correctly and that the process is consistent across all nodes.

Another benefit is that the team can save time and resources. Without the automated service, upgrading worker nodes would require manual labor, which is time-consuming and error-prone. With the service, the team can focus on other tasks while the upgrade is performed.

Danger, Robot in Action

Implementing an automated upgrade service has its challenges. One challenge is that the service must be carefully designed and tested to ensure it works reliably and efficiently. The team needs to consider various factors, such as the cluster’s size, the available resources, and the workloads running on the cluster.

The service serves us well today, and we use it often. For example, we also used the service for a “docker sunset” project. The project aims to replace the Docker runtime with the Containerd runtime on our production clusters. Initially, we were only installing Containerd on new nodes joining the cluster. However, it had to be replaced after Docker became deprecated as a container runtime in the Kubernetes ecosystem, So we used our service to ramp up the process of shifting to the new runtime app. We designed an executor that applies a reinstall function on nodes to reinstall them as Containerd machines. The result is a faster and more controlled process by using the service.

Bottom line, by using Kubeadm and creating a separate development cluster, we were able to overcome our challenges of keeping up with the fast-paced Kubernetes release cycle. Adopting a more agile and scalable infrastructure management approach is crucial for companies that use Kubernetes on a massive scale like Taboola. With our upgrading processes in place, we can keep everything up to date and promote new features for our R&D teams.

Originally Published:

Start Your Taboola Career Today!