We use the AWS-published EKS AMI (Amazon Machine Image) as a base to build our custom image for our managed Kubernetes clusters, which in turn is based on Amazon linux 2. Our CI system monitors the published AWS AMIs and automatically builds our custom base image, which is then rolled out to customer clusters based on our regular update cycle.
The latest version of the AWS AMI was published on October 27th, but it was recalled later on without any comments from AWS. Unfortunately our CI picked up this AMI before it was recalled, and built a new version of our EKS base image, which was then rolled out to only a handful of clusters, triggered by other unrelated updates.
It turns out that the recalled AMI was affected by a Linux kernel regression, reported in this Github issue. We assume that’s what forced AWS to recall the published AMI. In short, the issue affects the ability of the kubelet
to report the node status, and as a consequence, nodes based on the affected AMI could potentially flap between Ready
and NotReady
in short periods. This shouldn’t directly affect running workloads, but could impact new Pods being scheduled to affected nodes. We’ve seen this behavior in those clusters where the offending AMI was rolled out.
We’ve already built a new base image based on the last-known-working EKS AMI (v20220926
) and rolled it out to the affected clusters. You might have seen some nodes replaced in your clusters due to that. No further action is required from you, but as usual, if you notice any issues let us know.