Kubernetes CPU Resource Sharing in the Real World

DALL-E 3

I recently came up against this topic while trying to solve a bottleneck issue in production. While the end result was still ‘throw cpu at it’, in the process I learned that pod cpu resource settings and how they translate into actual host cpu usage is a bit more of a twisting path than I expected.

First lets cover the basics:

<somewhere in your deployment yaml>....    
spec:
      containers:
       —name: nginx
          image: nginx:latest
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 200Mi

So here we have our nginx container that is part of a pod, and the cpu requests are 100m while limits are 200m. The unit suffix ‘m’ here means ‘thousandths of a core’ and so we are asking to reserve .1 cores time usage (over a unit of time like a second).

Requests: Requests are guaranteed – i.e. the pod should always be able to use up to that amount in a given timeframe if needed, no matter what else is happening with the other pods on the node. The scheduler uses requests as a basis for deciding where to spin up new pods, so the total cpu and memory requests for pods don’t exceed what is available on a given node.

Limits: Caps on how much cpu or ram a pod can use. If the pod tries to use more than this amount of cpu, it gets throttled (or oom killed if we are talking about ram), and the containers in the pod will not be able to do more work in a given unit of time beyond that amount, i.e. amount of processing work per second. If other pods are using cpu up to their requests and this causes only 100m to be left on the node as extra cpu not already requested by pods, then our pod really only has an extra 100m to use, even if its limit is set much higher.

Lets say we have a 1 core node (nproc = 1). Whatever the operating system needs generally comes first (and can change, so watch out for that). Lets say our stable node processes use up 200m or 1/5th of the core’s time in a given second, and k8s really only has 800m left (allocatable) to schedule for pods. The scheduler can now only schedule 8 of our 100m nginx requests pods on the node, looking at just cpu usage (we ignore memory for this discussion). If all 8 of those pods are using right at 100m, then there really isn’t any available to use above that limit. In practice though many of those pods might be using less than the requests, and so one or more of them has some room to spike above the requests and hit the limit.

DALL-E 3

So something odd about this – the amount of extra available cpu if two pods are trying to use over their requests is allotted based on the ratio of their requests to each other. Lets go back to our node with 800m allocatable cpu. If we only have two pods, one nginx pod with 100m requests and 800m limits, and one mongo pod with 200m requests and 800m limits, then the total cpu limit is above what is potentially available by a factor of 2. Allocatable cpu is 800m on our node but the two pods have a limit totaling 1600m. If both pods are spiking cpu and would use up everything we have, then out of the total 800m available in the system only 500m are left over after requests, and the nginx pod would end up with half as much extra above its requests to use than the mongo pod. The actual mongo pod usage would probably settle on ~533m while the nginx pod would only go as high as 266m, again assuming they both suddenly wanted to use up all the cpu they could in a constant manner. This means that in situations where we have ‘spikey’ pods that could use up a ton of cpu depending on the incoming job they process, we don’t want them to have a low number of requests compared to other pods on the node unless we are certain no other pods will go above their requests as well. If pods will almost never be going above their requests then the relative requests for cpu matter less. In addition we need to know what extra space might be available on a node for cpu usage above requests.

So back to our prod issue. I had a bunch of pods in a deployment that ran cpu intensive jobs that could sometimes use up to their limit and were getting throttled, but most of the time would not. They were all running on the main node group using M class instances, and sharing cpu time with all sorts of random deployments and jobs. Looking at the metrics it was clear that some jobs were getting throttled, and it was unclear whether there was an easy way to uncap the cpu usage of these pods without knowing what else was running on the same node. It could be that most pods will run under requests and our cpu hungry pod will generally have plenty of cpu to use, but in this situation it needed to always be available without excessive cost, so I wanted to go a bit further than that.

DALL-E 3

Now in AWS most of us are aware that newer generation instances are supposed to be somewhat faster than the older ones. We also now know that requests are guaranteed, and that pods that could run high cpu at a moment’s notice (that we don’t want throttled) probably need a bit of isolation to have the headroom to do their job. So in this case, I added a node group for the workload consisting of C class instances of the latest generation available. I also increased the requests for cpu to a level that would capture most job runs these pods were processing while removing the limit altogether to allow anything to spike as it needed. Since the pods were the only ones on the new nodes besides the typical per-node daemonsets, even if two jobs wanted to suddenly use much more cpu they would at least share it equally. The uncapped cpu and newer instances proved to be the trick, shortening average processing times by up to half for this particular service. Case closed, for now.