For a long time now, we’ve been using the AWS Node Termination Handler for catching Spot instance interruption notices, allowing Kubernetes to respond appropriately by draining these (spot) nodes before they are terminated. You may have noticed this behavior via the “:construction: Instance interruption” notices in the Slack alerts.
We are now updating this Node Termination Handler’s (NTH) config to also react to many more events, like EC2 maintenance events and AutoScalingGroup Lifecycle events.
More concretely this means that the cluster will try to react and gracefully shutdown & migrate applications for any node shut-down event (not just spot terminations). Due to this, you might notice an increased number of “:construction: Instance interruption” messages in your alerting channel.
This change has already been rolled out to all non-production clusters. Production is scheduled for the week of 2/05.
In addition we are also experimenting with (optional) ASG Instance Refresh on some non-production clusters. This feature will automatically, and slowly, start replacing nodes if the ASG launch template get’s updated, for example when a new AMI is published. Together with the NTH updates, this should ensure a stable yet automated rollout for many of our clusters, reducing the need for manually scheduled rolling upgrades.