Post-mortem - A word on Pods stuck with ContainerCreating status problems

During the past days we’ve had multiple customer reports that launching new Pods got stuck with the ContainerCreating status, mostly affecting CronJobs. Even if you were not directly affected, you might have noticed the messages in your channels that we’ve been rolling out a lot of node updates without prior notice. This small post mortem will explain the reason, what steps we took to mitigate, and what we should have done better.

Timeline

  • Evening of 29/04: Our on-call engineer received an alert: one of our customer clusters could no longer launch CronJobs, due to pods being stuck in ContainerCreating status with the event message: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown. We traced it back to a kernel bug with the upstream EKS AMI. As a response to this escalation, we temporarily increased the affected net.core.bpf_jit_limit sysctl value.
  • Day 2/05: Our platform team dug deeper into the issue. So far only 1 cluster was affected by the bug and we decided to revert to an older AMI until upstream AWS would publish a fix. We replaced all the running EKS nodes for this affected cluster. Meanwhile we wrote a quick (manual) check to verify all other environments, resulting in no problems detected.
  • 4/05 AWS released a new upstream EKS AMI, containing the fix Our CI built our custom AMI, based upon the upstream one. We rolled out cluster changes to update all the AutoScalingGroup launchtemplates with the new AMI. After scanning all environments again to see whether they exhibited the issue, we believed it was not necessary to force a node rollout on all environments, so as not to disturb our customers. Instead we would rely on nodes slowly being replaced automatically due to autoscaling, fault recovery, …
  • 25/05 Several environments suddenly started showing the initial problem and other issues related to it. This had direct impact on customer production environments. We quickly realized what was going on and started force-updating the affected nodes. We rolled out all remaining changes for all environments (affected & non-affected).
  • 26/05 Most nodes across all environments have been updated. Some stragglers are left, which are known to cause customer workload disruption. We will get in touch to schedule a rollout for those affected nodepools.

What went wrong & lessons learned

We incorrectly assessed the impact of the bug, only preparing the fixes and assuming initially unaffected environments wouldn’t start failing before all patches were actually rolled out. By not immediately forcing a rolling upgrade accross all running EKS nodes, it lead up to sudden failure weeks later after we had a fix ready.

In addition we just used a 1-off check to detect the problem across all environments, instead of creating a real monitoring alert for it.

Assumption is the root of all evil and with critical bugs like these we need to actively roll out our fixes as soon as possible.