POC Time – Handling Graceful Pod Shutdown in Kubernetes For Pods That Process Data

DALL-E 3

Here’s the scenario: Lets say we have a deployment in our kubernetes cluster for pods that do some sort of important processing work. They might be doing something like pulling messages from a queue and then processing uploaded files from your customers. Now if the processing task is quick and there is a way to retry if the job gets interrupted, then perhaps it isn’t really a big deal to you if a pod gets moved around by the autoscaler, or a rolling deployment update. However, if this is something customer facing that takes a while to complete, then perhaps you don’t want visitors to your website to get frustrated when they click on a button, wait a minute for something, and end up with an error message asking them to re-try their task.

When you ‘delete’ a pod, or a pod is shutdown due to a scaling event, or you rollout a new deployment, kubernetes sends a SIGTERM to the pod. This signal doesn’t forcibly shutdown the pod, but after a certain period of time defined by a field ‘terminationGracePeriodSeconds’ which defaults to 30 seconds if you haven’t set it, a SIGKILL is send to the pod which then forcibly stops its running processes.

...<snip>...
spec:
  containers:
    - command:
        - python
        - my_program.py
      image: bitnami/python:3.9
      name: my_program
  terminationGracePeriodSeconds: 5  # wait after SIGTERM before killing the pod

So in this simple yaml snippet above from a deployment, the pod will run a python file as its ‘work’ and when you delete the pod it will actually take 5 seconds to stop the pod’s processes. The pod doesn’t know this is happening as we haven’t built it into our code, we are just running some python to process jobs. Now this grace period setting may be useful in our case above, as perhaps we know each job we process with our pods takes approximately one minute. So we might set terminationGracePeriodSeconds: 70 or so in our case to allow for a job to process fully before the pod is terminated. But this assumes that our pod won’t pick up a new job after the first one finishes, and that is probably not the case, right? We don’t want any running job to be interrupted, so setting a grace period isn’t enough here, though it is a part of the solution to our problem.

DALL-E 3

If you look up ‘sigterm’ you will find that this signal can be intercepted and handled by an application. The application can be built in such a way that it can know a sigterm has been sent to it, and react accordingly. So in theory is should be possible to create a pod that ‘knows’ it has been asked to shutdown, and can make sure that whatever it is currently processing is the last thing it completes before exiting on its own. What might this look like, for, say a simple python application? Lets take a look at an example that you might run inside your pod container image.

 # Python example code you might be running that is aware of a SIGTERM
 
 import time
 import threading
 import signal
 
 # Global flag indicates if a shutdown signal has been received
 shutdown_flag = threading.Event()
 
 # once bound to an incoming signal like sigterm, this sets the flag
 # indicating that a shutdown request has been received
 def sigterm_handler(signum, frame):
     # Signal the application to shutdown
     shutdown_flag.set()
     print("Shutdown request was received from k8s")
 
 # Bind the SIGTERM signal to the signal_handler function
 # In practice this means that a sigterm will automatically run the sigterm_handler function while your other code is running
 signal.signal(signal.SIGTERM, sigterm_handler)
 
 def main():
     # Main loop - this would be the part of your code that pulls
     # and processes jobs. The idea here is that it checks to make sure
     # no shutdown event has been received BEFORE it tries to process any job
     while not shutdown_flag.is_set():
         try:
             print("Pulling a new job to process")
             time.sleep(60)  # Simulate work with sleep
         except KeyboardInterrupt:
             # Allow for graceful exit on Ctrl+C, good when local testing :)
             shutdown_flag.set()
 
     # Perform any cleanup after loop has exited
     print("Cleaning up before shutdown")
     time.sleep(2)  # Simulate cleanup tasks
     print("Shutting down...")
 
 if __name__ == "__main__":
     main()

So here what we have done is make our program running in the pod aware of a SIGTERM event. This means that now not only is there a graceful shutdown period, but when shutdown is requested the running process itself will exit after a processed job has completed if we have already asked it to shutdown. As long as the job processing time is less than our terminationGracePeriodSeconds setting, we can rest easy that a deployment rollout or other event won’t interrupt our running jobs in those pods. In our case, that means that a customer using our website will have a much smaller chance of getting that pesky error that forces them to retry their file upload (or whatever else they were doing that required time to complete). Fewer cases for support, and a better end user experience win the day.

On the DevOps/SRE side you might not be directly responsible for making the code changes like this to make our service more reliable, but knowing what is possible and passing it on to the right people is half the battle.