Guaranteed QoS for all critical system and infrastructure Pods

We’ve seen in multiple occasions that, due to resource starvation in a cluster, the kubelet starts evicting critical infrastructure Pods. This can lead to important downtimes and disruptions in multiple occasions.

We have added a custom PriorityClass with the name infra-cluster-critical and configured that class for the cluster-autoscaler, Prometheus, Alertmanager and nginx-ingress.

For Alertmanager and the cluster-autoscaler we also set CPU limits. So for those deployments we have now a guaranteed QoS. We did not configura any CPU limits for nginx-ingress and prometheus yet to avoid accidental throtting.

Skyscrapers Changelog

Related Posts

Upgraded Concourse CI to version 7.14.1 21 Aug 2025

Expanding our Customer Enablement Team (know anybody?) 20 Aug 2025

Upgraded cluster add-ons 06 Aug 2025