EKS Node Groups – What NOT To Do When Changing Instance Types

DALL-E 3

Whether you use Terraform or something else to code your EKS cluster configurations, you will have likely run into the following scenario:

Hmm – these nodes are old, lets swap them out with a newer instance type.

Wait.. Why can’t I just change the instance type in my node group? Or the name, for that matter?

– You, at 4:30 pm on a Friday thinking you could put off a ‘simple’ upgrade until then

So I tried a few tricks in my Terraform for my node groups to speed this up, and in fact, you can switch node types on a node group IF you rename the node group AND change the instance types out at the same time. Terraform will add a new node group with the new name and types, and then rip out the old one. This process, however is outside of Terraform’s control, and is dictated by the way the AWS api for the nodegroups functions. No matter the tooling, this will be the result. It does ‘work’, however.

DALL-E 3

In lower environments. What this process doesn’t do is play safe with the old node group deletion, or bother to spin up enough new instances quickly enough for the pods to move over without downtime. Have pod disruption budgets? Tough. Need to slowly kick your ingress controller pods to prevent too many lost requests? Tough. Even if you realize what is happening and turn up the number of nodes in the new group quickly, you are still likely to end up with at least a small production outage.

There may be other ways around this, but my new plan is to just create a new node group first, with enough nodes for the load, and then carefully cordon/drain the old nodes to push pods onto the new ones. You would, of course need to cordon all the old nodes to prevent pods from jumping from one old node to the other, though generally the scheduler would put the pods on the newer, emptier nodes. Once the old nodes have nothing critical left on them, the old node group can be removed on a new PR/apply.

I’m always amazed at how many AWS services like this feel ‘unfinished’ or have critical things in their design that leave me wondering… why?? Maybe this is another question for the TAM.