0

I have an existing CronJob for which I have setup alerts. The alerts are working when the CronJob is scheduled in the 'monitoring' namespace. I am using Kube-Prometheus stack and it is also deployed in the 'monitoring' namespace. When I schedule the CronJob in another namespace, named 'labs' the alert does not fire and I am not receiving any email.

This is my configuration when I schedule the CronJob in the 'monitoring' namespace:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: failing-cronjob
  namespace: monitoring
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  schedule: "*/3 * * * *"  
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: job-container
            image: busybox
            imagePullPolicy: IfNotPresent
            command: ["/bin/sh", "-c"]
            args:
            - exit 1  
          restartPolicy: Never
          terminationGracePeriodSeconds: 10  
      backoffLimit: 0

The PrometheusRule :

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-cron-rule
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          
        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
        
        - alert: CronJobStatusFailed
          expr: |
            job:kube_job_status_failed:sum
            * ON(cronjob, namespace) GROUP_LEFT()
            (kube_cronjob_spec_suspend == 0)
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

And the AlertManagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: cronjob-failure-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: cron-email
    routes:
      - matchers:
        - name: job
          value: cron-failure
        receiver: cron-email
  receivers:
    - name: cron-email
      emailConfigs:
        - to: 'user@mycompany.com'
          from: 'ops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

This configuration is working and the alerts are firing and being delivered.

However, when I schedule the CronJob in the 'labs' namespace with the following configuration, the alerts are not firing and they are not being delivered:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: failing-cronjob
  namespace: labs
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  schedule: "*/3 * * * *"  
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: job-container
            image: busybox
            imagePullPolicy: IfNotPresent
            command: ["/bin/sh", "-c"]
            args:
            - exit 1  
          restartPolicy: Never
          terminationGracePeriodSeconds: 10  
      backoffLimit: 0

        

and the resultant PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-cron-rule
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          
        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
        
        - alert: CronJobStatusFailed
          expr: |
            job:kube_job_status_failed:sum
            * ON(cronjob, namespace) GROUP_LEFT()
            (kube_cronjob_spec_suspend == 0)
          labels:
            severity: critical
            job: cron-failure
            namespace: labs
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

For now it appears as if the PrometheusRule will only discover targets that are in the 'monitoring' namespace.

I have checked both the Prometheus and AlertManager logs but they are no errors in both.

What am I missing ?

Golide
  • 835
  • 3
  • 13
  • 36
  • I thin to receive any help you'll need to find out where your problem happens: is it discovering part or your rules. Check if your you prometheus scraped data from second target (the one in labs namespace) at all. If no - rules aren't part of the problem, and you need to look into target discovery config. And if yes - debug your rules. – markalex Aug 25 '23 at 12:29

1 Answers1

0

It seems like you have set up your PrometheusRule to discover targets only in the 'monitoring' namespace. This is why the alerts are not firing when the CronJob is scheduled in the 'labs' namespace. To make your alerts work across different namespaces, you need to modify your PrometheusRule configuration to make it namespace-agnostic. You can achieve this by using ServiceMonitors and configuring them to select the appropriate targets.

Please reply if you need further help or some examples.

If you do it on your own. Happy Coding.

pwoltschk
  • 550
  • 1
  • 3
  • 22