-1

problem and the question

the question has been asked and answered here Monitor Kubernetes Cronjob and there are a couple great writeups on the web Prometheus: K8s Cronjob alerts and i do not like making this separate thread but i can not provide feedback on that post. i was never able to get them all to make sense. Until yesterday

Starting down the well worn path

using job status success and job status fail vs running and continued running in to the same issues of alerting forever. Then there is Tristan's which is really good but was based on an older version of prom/grafana and as such there is a lot of label replacing and some other left to right and right to left grouping that (to me) was fragile. then finally as others have seen there is often a "many to many" error that happens when you are looking at the job failure. this occurs as the job failure will report on not just back off but and also timeout exceeded and evicted.

standing on their shoulders

i took all the well known pieces and tried simplifying for myself. i am part of a larger enterprise with a many consumer kubernetes cluster so i have to be specific about what i'm looking for

if you want a detailed explanation look to Tristan's write up. basically this section boils down to

  • find the "orphan" jobs with their k8s name
  • associate them with their cronjob
  • get only the current job
clamp_max(
    max by (namespace, owner_name, job_name) (
        max by (namespace, owner_name, job_name) (
            kube_job_status_start_time{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"}
            * on (job_name) group_left(owner_name)
            max by (namespace, owner_name, job_name) (
                kube_job_owner{owner_kind="CronJob",namespace="NAMESPACE"}
            )
        )
        ==
        on (namespace, owner_name) group_left
        max by (namespace, owner_name) (
            kube_job_status_start_time{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"}
            *
            on (job_name) group_left(owner_name)
            max by (namespace, owner_name, job_name) (
                kube_job_owner{owner_kind="CronJob",namespace="NAMESPACE"}
            )
        )
    ),
    1
)

i am still way to shaky about the "on" and "group_*" to try to explain.

then the final part is "are we failing"

* on (namespace, job_name) group_left()
sum by (namespace, job_name)(kube_job_status_failed{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"} != 0)

this means "am i failing at all" I don't care what reason it failed, basically anything other than non zero. I want to be alerted. you can and should create specific rules for the various "reason" in order to have more granularity. me i'm just happy to get this working.

complete alert

sum(
clamp_max(
    max by (namespace, owner_name, job_name) (
        max by (namespace, owner_name, job_name) (
            kube_job_status_start_time{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"}
            * on (job_name) group_left(owner_name)
            max by (namespace, owner_name, job_name) (
                kube_job_owner{owner_kind="CronJob",namespace="NAMESPACE"}
            )
        )
        ==
        on (namespace, owner_name) group_left
        max by (namespace, owner_name) (
            kube_job_status_start_time{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"}
            *
            on (job_name) group_left(owner_name)
            max by (namespace, owner_name, job_name) (
                kube_job_owner{owner_kind="CronJob",namespace="NAMESPACE"}
            )
        )
    ),
    1
)
* on (namespace, job_name) group_left()
sum by (namespace, job_name)(kube_job_status_failed{aws_region="REGION",environment="ENV",namespace="NAMESPACE", job_name=~"JOB-WILDCA.*"} != 0)
) > 1

Conclusion

now the kicker make sure you look for sum() > 1 on that entire mess for the alert. AND if you run this and you have not actually had a "current job run fail" condition occur there will be nothing returned. i tested this in the "explore tab" and rolled back through 35 days worth of cron runs. this shows events like the following. failure events i am not entirely happy with summing the many-many "reason". i'm still researching a better solution and or feedback from others.

thanks for reading, and the edit was to change a final max/clamp_max to "sum"

  • Stack Overflow is not a blog. Consider rewriting this into an actual question and answer. Or maybe adding one more answer to linked in the beginning question. – markalex Aug 03 '23 at 18:13
  • i can't add an answer cause even though i've used this resource for a long time, its the first time i've wanted or felt able to contribute so i "don't have enough karma to comment" on the other threads. and i follow what you are saying about this not being a blog. thank you for that feedback – user2654268 Aug 03 '23 at 18:21
  • There is no requirements on reputation to add an answer (on Stack Overflow). I was proposing you to add an answer [here](https://stackoverflow.com/q/47343842/2654268) – markalex Aug 03 '23 at 18:44
  • i can not answer the question "Highly active question. Earn 10 reputation (not counting the association bonus) in order to answer this question. The reputation requirement helps protect this question from spam and non-answer activity." – user2654268 Aug 03 '23 at 19:14
  • Huh. Wasn't aware of such limitations. Then rework this into a proper Q&A pair. Or else it probably will be closed and later automatically deleted, and I think it should be saved. – markalex Aug 03 '23 at 19:34

0 Answers0