Support for cronjob monitoring

Update (18-03-2019): We found out there were enough default alerts covering all cases of cronjob failures. The following alerts are covering different failure cases accordingly:

  • KubeJobCompletion: Warnning alert after 1 hour if any Job doesn’t succeed or doesn’t run at all.

  • KubeJobFailed: Warning alert after 1 hour if any Job failed.

  • KubeCronJobRunning: Warning alert after 1 hour if a CronJob keeps on running.

You can check those metrics in your cluster Prometheus (https://promehtues.your.cluster.domain), and if you have previously added the cronjob label to monitor cronjobs you can now remove it.


Original post (16-01-2019):

We have updated the staging clusters with support to have monitoring for cronjobs. The monitoring will trigger a critical alert when the last run of the cronjob did not succeed.

If you want to enable the monitoring for your cronjobs you just need to set a label called cronjob with any value that you like.

Example usage:


apiVersion: batch/v2alpha1
kind: CronJob
metadata:
  name: -daily-cronjob
  labels:
    app: 
    cronjob: 

If we don’t uncover any issues in the staging clusters during the next few days, we’ll rollout the upgrade to all the production clusters next week.