Our cluster monitoring stack is based on the prometheus-operator
developed by the people at CoreOS, more concretely we used kube-prometheus
as a starting point for a complete setup.
This project has seen numerous changes and improvements, like the introduction of the PrometheusRule
CRD as a replacement for using plain ConfigMaps
for defining alerting rules.
Now these set of charts have been migrated to the upstream Kubernetes charts, under the prometheus-operator
name and thus we’ve rebased our monitoring wrapper to the upstream one. This is quite a big update of our monitoring stack, bringing some nice features:
- Newest versions of Prometheus (2.4.1) and Grafana (5.3.4)
- Updated and completely new Grafana dashboards
- You can now add your own Grafana dashboard via a simple
ConfigMap
- It is no longer needed to define
ServiceMonitors
in the same namespace where Prometheus is running
Defining Grafana dashboards
You can setup your own Grafana dashboards by adding a ConfigMap
, which has a grafana_dashboard
defined. Be sure to have Prometheus
set as datasource
in the dashboard.
apiVersion: v1
kind: ConfigMap
metadata:
name: my-dashboard
labels:
grafana_dashboard: true
data:
my-dashboard.json: |
{ <Grafana dashboard json> }
Migration process
We’re replacing the current monitoring stack on our staging clusters starting today. Production clusters will follow next week.
Unfortunately, due to several changes, historic data will be lost.
For now we’re focussing on the cluster-monitoring, so you don’t have to take any actions concerning the application-monitoring yet.
What’s next
Currently we’re running a separate Prometheus which you can use for your application-specific metrics & alerting. Since the original requirement for which we chose this path is no longer valid, we’ll move all monitoring to the Prometheus instance running in the infrastructure
namespace too.
The goal is to make this as transparent as possible, and it’s likely you won’t need to take extra steps.
In any cae we’ll keep you posted once this change rolls out in the next week(s) via this Changelog.