Ingress Nginx Zero Downtime Configuration for AWS


I’m not clear on why this took me a few days to figure out. I suppose I was dealing with other requests as they came up, as is the job.

Now I wasn’t the person who installed most of our EKS ingresses, but I haven’t taken a ton of time to see how they were setup, or even to understand the inner workings of the controller itself. I just knew it ‘worked’, and that I had a ton of other things to think about. To be honest adding a new ingress has been mostly copypasta and changing a few items in the values file.

So we had decided that we needed to get away from non JSON logging for our ingresses. This made sense as whatever logging solution we use JSON will be much easier to deal with. The regex to take the default log line from the access logging for the out of the box configuration for ingress-nginx is just.. Gross. One thing off anywhere and the log line wont get parsed at all.

So that issue now fixed in our charts, I proceeded to slowly update our ingress controller deployments, all ~20 of them. I had done this before for other reasons, of course, so I didn’t think much about it. After a while I get a message on slack. Some critical service in one of our production clusters had been inaccessible for a full minute. Huh? Sure enough I go back and look at the timing and it was right when one of our ingress controller deployments was rolling. Cause found, but this wasn’t cool. After all, with HPA scaling controller pods up and down this would mean that randomly during the day some end users might be getting errors. The SRE hat came out and I started to dig into the problem.

So I went back to one of the lower environments, and setup a simple one liner in bash that curled an endpoint and gave me the result every second, showing status code and response time:

while :; do curl -o /dev/null -s -w "Status Code: %{http_code}, Time: %{time_total}\n" [URL]; sleep 1; done

So now I was ‘monitoring’ – I looked for the ingress this endpoint was on and found it’s deployment. A rollout restart should be a good way to see this issue. As the pods rolled, I saw the endpoint fail a few times, and a few other times it just took way too long to respond. Yeah – not good. This had to be fixed.

Now digging around, the docs for the ingress-nginx controller state somewhere that yes you will get some request failures on an upgrade/etc of the deployment. What I don’t see is any kind of examples to work around this, or charts that other folks have made to make this service bullet proof in AWS behind, say, a classic LB. Seems to me the point of K8S and an ingress controller is to be highly available in the first place, right? (Yes I should probably PR something in the docs to help out here)

I spent many hours trying various things, and even went so far as to dig in the kubernetes slack group. Turns out there are a number of things that can go wrong if you just use the chart for ingress-nginx with most of the default options.

  • ExternalTrafficPolicy – Now this option, if you have it set to ‘Local’, will cause reliability issues when a pod gets terminated. The reason is that k8s will actually shut the nodeport off for your controller pod the second it gets the termination request, and this doesn’t allow time for any sort of graceful removal of the target in the LB. There is a feature flag ‘ProxyTerminatingEndpoints’ that is supposed to allow traffic to continue to terminating pods on a node but I haven’t seen it work (It’s alpha on my cluster version) – and I might not want the entire cluster’s behavior changed here. For now I’m giving up and using the ‘Cluster’ option even though I don’t want the extra hop. I was able to verify that in my case any additional latency is small enough to be unmeasureable. Even if you somehow use tagging at the LB level to target just the nodes your pods are on and think you are getting away from the extra hops, you aren’t, as the ‘Cluster’ trafficpolicy will spread the requests around, not necessarily preferring the local pod. The downside of this however is that client_ip gets obscured so you might need to find a way to use proxy protocol to prevent this, and in some cases this may not be possible. Testing will be required.
  • A pod may sometimes spin up quick enough that the health check endpoint at /healthz is working but the main process with all the lua and other goodness isn’t really ready to serve requests at a normal clip. The load balancer might add your node too quickly and the traffic could be extra-slow or bounce. Apparently, setting controller.minReadySeconds: 30 is recommended by some folks using it in AWS (or elsewhere) to prevent this.
  • There is a built-in pre-stop hook for the controller that allows it to terminate in a more graceful fashion. Setting the controller.extraArgs.shutdown-grace-period value will add a pause to the shutdown process. The controller will actually stop responding first on its /healthz endpoint while continuing to process requests, giving the cluster plenty of time to stop sending new requests to your pod. The value will depend on what you think the longest lived connection will be, which for us is perhaps 60 sec. I would set this to a higher value than the draining for the LB below, maybe 90 in our case.
  • Looking at the LoadBalancer now, I see that connection draining isn’t enabled by default. However we aren’t using targets now in a simple way, due to the ‘Cluster’ policy from above. There isn’t much we can do if a node suddenly fails to improve the process but in theory we would always want our LB to be set to drain based on our longest lived connection anyway, so I will set the following even though it won’t make much of a difference in most cases:
    • “true”
    • “60”
  • Lets look a bit more closely at the LB – I notice that cross-zone load balancing isn’t enabled by default either… What? My deployments don’t know AZ exists, so the pods would randomly get assigned to an AZ. Its possible for some requests to fail just due to the random distribution of pods and nodes, depending on cluster size and number of replicas. SMH. OK – lets add another annotation to controller.service.annotations to prevent this possibility:
    • “true”

Ok so part of the problem I had with figuring all this out was that the settings didn’t seem to work the same everywhere. After much confusion it turned out that some additional networking was causing some LB setups to appear to be more reliable than others. In addition I had trouble in a few cases trying to get the proxy protocol working. It’s not yet clear why but on some clusters I had to disable and re-enable it a few times to get things working. I found some references online to issues related to the healthcheck ports not working on SSL mode, so it could help in your case if you get stuck to set the healthcheck port manually to 80, but I’m not sure.

It’s not yet ‘perfect’ as there are still unresolved issues around what happens when a node gets scaled out. There isn’t a mechanism in place to tell the LB to drain and remove the node before it gets killed as the ‘Cluster’ policy will show a good healthcheck no matter what on every node. If I can get the LB to somehow know that a node has been selected for removal then that might get fixed.

I’ll call this a ‘partial win’ for now.