Guaranteed QoS for all critical system and infrastructure Pods

We’ve seen in multiple occasions that, due to resource starvation in a cluster, the kubelet starts evicting critical infrastructure Pods. This can lead to important downtimes and disruptions in multiple occasions.

We have added a custom PriorityClass with the name infra-cluster-critical and configured that class for the cluster-autoscaler, Prometheus, Alertmanager and nginx-ingress.

For Alertmanager and the cluster-autoscaler we also set CPU limits. So for those deployments we have now a guaranteed QoS. We did not configura any CPU limits for nginx-ingress and prometheus yet to avoid accidental throtting.