Stuck between a rock* and a hard place**?
*needing to maintain long-running, customer-critical tasks
**needing to deploy code on Kubernetes clusters that execute those very long-running tasks
As an engineer working on a unified API that handles many, many long-running tasks, I’ve encountered just that very problem. Together, our team has worked out a solution that manages the tension between customer-critical operations and code deployments on Kubernetes clusters. Maybe you’ll find the solution helpful too.
This article assumes you’re at least a little familiar with Celery, Kubernetes, and data-intensive applications. It also acts in conversation with a post from our friends over at Brex - so make sure you take a read if you want to fully understand the problem and range of solutions!
Without further ado, let’s get into it.
Let’s break down the problem landscape.
At Merge, our backend services run AWS EKS and use the Celery framework for scheduling asynchronous work.
Many of our tasks, such as the initial creation of a customer’s account, can take hours to run. Why? As a unified API, we’re dependent on third-party API-specific features, such as rate limits, that inhibit our ability to pull data quickly.
Now, imagine where this run time becomes an issue: at the same time that one of these long-running tasks is operating, we need to deploy code as part of a Kubernetes cluster. The very cluster that is responsible for running those tasks.
It’s a lot like that two-buttons meme.
The feeling of the high-wire act of balancing different deployments with Kubernetes
Boiled down, our team found this problem is due to a combination of 3 factors, all linked to how we restart our servers:
- The requirements of the Kubernetes shutdown process. This involves: sending SIGTERM to the container, waiting a configured amount of seconds (terminationGracePeriodSeconds) hoping that the container stops itself, then sending SIGKILL to the container, forcing the end of all running processes.
- Celery SIGTERM behavior. This is when the worker will stop trying to pull in new tasks while continuing to work on tasks it has already pulled from the Celery broker (which is a global, persistent task queue). This is known as a “warm shutdown” in Celery lingo.
- The default terminationGracePeriodSeconds configuration is 30 seconds. While plenty for most tasks, it’s barely enough to run even a single API request depending on the Accounting / ATS / HRIS / Ticketing / CRM platform we are unifying.
While our friends at Brex wanted to keep their dependencies in sync with Elixir/Kafka, we want to keep long-running tasks moving with Python/Celery. Still, the methods we’ve used to solve our problems are largely the same.
(If it's not clear enough - I really recommend reading Brex’s article about their solution!)
The Fix: Rolling Updates
Let’s start with what doesn’t work.
A naive approach involves increasing the terminationGracePeriodSeconds setting to several minutes. While this would work from a Celery and Kubernetes perspective, it would not play well with our continuous integration scripts which waited along with Kubernetes for that duration.
As they say in their docs, “successful removal is awaited before any Pod of the new revision is created.”
Rather than have our CI controlling the entire process, and waiting alongside Kubernetes, we’d rather have Kubernetes be intelligent about its own restarts and let the CI scripts hand off the responsibility to the cluster.
Thankfully, such a capability exists in the RollingUpdate deployment type.
See, prior to our current configuration, we had been using the Recreate deployment type. This caused the re-starting behavior that introduced our rock-and-hard-place problem. The magic of RollingUpdate, then, is that it crucially lets you configure two values that let Kubernetes handle the rollout on its own:
- maxSurge: the percentage above your normal pod count that you can have during the deployment. So if you normally have 10 pods, and configure a maxSurge of 20%, you will have up to 12 pods during the deploy.
- maxUnavailable: the percentage of your normal pod count that can be anything OTHER than READY state during the deploy. So for 10 pods, and a maxUnavailable of 20%, at least 8 will be READY because the setting stipulates that at most 2 may be non-READY. (The double negative may be confusing, but you can think of it as ~minAvailable if your name is De Morgan).
- Note that TERMINATING (which is the status that our long-running task pods will have) counts towards the maxUnavailable count, meaning that it is the % of our pod count that we want to allow to continue running to celery task completion during the RollingUpdate.
No more sweating over two equally important goals!
With our rolling updates, we can have confidence that long-running tasks have enough time to wrap themselves up, as well as ensure that we can deploy the latest versions of our services to the necessary clusters.
Astute link clickers will note that we came to a different conclusion than Brex did. For Brex, the solution was in the maxUnavailable setting, which allowed them to guarantee all pods were running and alive, rather than be stuck in a loop of mismatched deployments. It’s worth pointing out that their Elixir libraries do not behave in the same manner as our Celery tasks with a warm shutdown thankfully matching Kubernetes expected behavior on SIGTERM, a fortuitous alignment for Merge.
As Thomas writes in Brex’s article, “Graceful shutdown is hard to implement right, as all the pieces of the infrastructure and the software must be properly configured and implemented in harmony.”
Time will tell if we run into cross-dependency issues for other reasons though! For example, we only do backward-compatible database changes but if there ever was a database migration that was not backward compatible then we’d be in trouble with our non-terminating long-running tasks.
On Using Helm (an added bonus)
Merge also uses helm for managing our Kubernetes configurations. While great to work with, I found it strangely difficult to find the documentation for where RollingUpdate goes in a chart.
So, here is where both settings are in our chart: