Updated Prometheus & Grafana monitoring stack

Our cluster monitoring stack is based on the prometheus-operator developed by the people at CoreOS, more concretely we used kube-prometheus as a starting point for a complete setup.

This project has seen numerous changes and improvements, like the introduction of the PrometheusRule CRD as a replacement for using plain ConfigMaps for defining alerting rules.

Now these set of charts have been migrated to the upstream Kubernetes charts, under the prometheus-operator name and thus we’ve rebased our monitoring wrapper to the upstream one. This is quite a big update of our monitoring stack, bringing some nice features:

  • Newest versions of Prometheus (2.4.1) and Grafana (5.3.4)
  • Updated and completely new Grafana dashboards
  • You can now add your own Grafana dashboard via a simple ConfigMap
  • It is no longer needed to define ServiceMonitors in the same namespace where Prometheus is running

Defining Grafana dashboards

You can setup your own Grafana dashboards by adding a ConfigMap, which has a grafana_dashboard defined. Be sure to have Prometheus set as datasource in the dashboard.

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-dashboard
  labels:
    grafana_dashboard: true
data:
  my-dashboard.json: |
{ <Grafana dashboard json> }

Migration process

We’re replacing the current monitoring stack on our staging clusters starting today. Production clusters will follow next week.

Unfortunately, due to several changes, historic data will be lost.

For now we’re focussing on the cluster-monitoring, so you don’t have to take any actions concerning the application-monitoring yet.

What’s next

Currently we’re running a separate Prometheus which you can use for your application-specific metrics & alerting. Since the original requirement for which we chose this path is no longer valid, we’ll move all monitoring to the Prometheus instance running in the infrastructure namespace too.

The goal is to make this as transparent as possible, and it’s likely you won’t need to take extra steps.

In any cae we’ll keep you posted once this change rolls out in the next week(s) via this Changelog.