Update (18-03-2019): We found out there were enough default alerts covering all cases of cronjob failures. The following alerts are covering different failure cases accordingly:
-
KubeJobCompletion: Warnning alert after 1 hour if any Job doesn’t succeed or doesn’t run at all.
-
KubeJobFailed: Warning alert after 1 hour if any Job failed.
-
KubeCronJobRunning: Warning alert after 1 hour if a CronJob keeps on running.
You can check those metrics in your cluster Prometheus (https://promehtues.your.cluster.domain), and if you have previously added the cronjob
label to monitor cronjobs you can now remove it.
Original post (16-01-2019):
We have updated the staging clusters with support to have monitoring for cronjobs. The monitoring will trigger a critical alert when the last run of the cronjob did not succeed.
If you want to enable the monitoring for your cronjobs you just need to set a label called cronjob with any value that you like.
Example usage:
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: -daily-cronjob
labels:
app:
cronjob:
If we don’t uncover any issues in the staging clusters during the next few days, we’ll rollout the upgrade to all the production clusters next week.