Monitoring for Grafana Loki in case of discarded logs

During a routine monitoring review, we’ve noticed some Promtail pods were using significantly more CPU than the generic request. This pointed us to two issues:

  1. Although using the Vertical Pod Autoscaler, CPU requests for Promtail pods was not being updated. This is because the whole Daemonset gets evaluated and the lower CPU bound is used for all pods. We’ve discarded using the VPA for this component, and instead allow manual override of these resources where needed.

  2. Grafana Loki, our default logging solution, was rate limiting some of those promtail pods, which can cause dropped logs. We’ve added a new LokiDiscardingSamples alert to notify us in such cases where Loki is dropping requests. In addition we have also increased the default values for Loki ingestion limits (overridable).

These changes have been rolled out to all environments.