We monitor the KubeJobFailed via prometheus by the expression:
kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0
As for now, we receiving multiple alerts for the same job (the job isn't get 'resolve' between the alerts).
We would like to get alert only for the 1st time the job failes.
Is that related to prometheus expression or I should edit the YAML?
This is how it set in the YAML:
- match:
alertname: 'KubeJobFailed'
repeat_interval: 1h
receiver: "slack-k8s-dev"
continue: true
P.S - We do not want to delete jobs.
I've tried to delete the repeat_interval, but it follows the default interval.