0

I have a deployment of a pod that continuously goes into CrashLoopBackoff state. I have setup an alert forthis event but the alert is not firing on the configured receiver. The alert is only firing on the default AlertManager receiver that is configured with each AlertManager deployment.

The AlertManager deployment is part of a bitnami/kube-prometheus stack deployment.

I have added the custom receiver to which the alert should also be sent.This receiver is essentially an email recipient and it has the following definition:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: pod-restarts-receiver
  namespace: monitoring
  labels:
    alertmanagerConfig: email
    release: prometheus
spec:
  route:
    receiver: 'email-receiver'
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 5m
    matchers:
      - name: job
        value: pod-restarts
  receivers:
  - name: 'email-receiver'
    emailConfigs:
      - to: 'etshuma@mycompany.com'
        sendResolved: true
        from: 'ops@mycompany.com'
        smarthost: 'mail2.mycompany.com:25'

This alert is triggered by the PrometheusRule below :

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-restarts-alert
  namespace: monitoring
  labels:
    app: kube-prometheus-stack
    release: prometheus
spec:
  groups:
    - name: api
      rules:
        - alert: PodRestartsAlert 
          expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
          for: 1m
          labels:
            severity: critical
            job: pod-restarts
          annotations:
            summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
            description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."

I have extracted the definition of the default receiver in AlertManager pod as follows:

kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 -- 
sh
cd conf
cat config.yaml

And config.yaml has the following definition:

route:
group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

And I have also extracted from the AlertManager UI , the global configuration of the deployment. As expected it shows that the new alert receiver has been added :

global:


resolve_timeout: 5m
  http_config:
    follow_redirects: true
    enable_http2: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
  webex_api_url: https://webexapis.com/v1/messages
route:
  receiver: "null"
  group_by:
  - job
  continue: false
  routes:
  - receiver: monitoring/pod-restarts-receiver/email-receiver
    group_by:
    - alertname
    match:
      job: pod-restarts
    matchers:
    - namespace="monitoring"
    continue: true
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 5m
  - receiver: "null"
    match:
      alertname: Watchdog
    continue: false
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
receivers:
- name: "null"
- name: monitoring/pod-restarts-receiver/email-receiver
  email_configs:
  - send_resolved: true
    to: etshuma@mycompany.com
    from: ops@mycompany.com
    hello: localhost
    smarthost: mail2.mycompany.com:25
    headers:
      From: ops@mycompany.com
      Subject: '{{ template "email.default.subject" . }}'
      To: etshuma@mycompany.com
    html: '{{ template "email.default.html" . }}'
    require_tls: true
templates: []

EDIT

I have a number of questions from the global config for AlertManager :

  1. Strangely enough in the global configuration my receivers are "null". (Why?)
  2. The topmost section of the global config doesn't have any mail settings (Could this be an issue?).
  3. I'm not sure if the mail settings defined at AlertManagerConfig level work or even how to update the global config file (its accessible from the the pod only). I have looked at the values.yaml file used to spin up the deployment and it doesn't have any options for smarthost or any mail settings
  4. There is an additional matcher in the global config file named - namespace="monitoring" . Do I need to add a similar namespace label in the PrometheusRule ? .
  5. Does it mean AlertManagerConfig has to be in the same namespace as the PrometheusRule and the target pod

The AlertConfigManager is also not visualizing anything at https://prometheus.io/webtools/alerting/routing-tree-editor/

What exactly am I missing ?

Golide
  • 835
  • 3
  • 13
  • 36
  • This is an ongoing issue in which Alertmanager is not sending alerts to webhook endpoints. You can refer to the [Github link](https://github.com/prometheus/alertmanager/issues/2404) for more information. Also, Alertmanager filters alerts based on their labels. Make sure that the labels on the alerts match the labels defined in the Alertmanager configuration file.
    – Srividya Jul 13 '23 at 18:52
  • @Srividya I have amended the labels they match. The issue "seems" related to my receiver . It appears null in the AlertManager global config file --- > route: receiver: "null". The default config also has an additional matcher for namespace --> matchers: - namespace="monitoring" . I have since added this namespace label again but no changes. Strangely enough I cant even visualize the config at https://prometheus.io/webtools/alerting/routing-tree-editor/ – Golide Jul 14 '23 at 08:57

1 Answers1

0

The issue was being caused by TLS verification failure. After checking the logs this is what I found out :

kubectl -n monitoring logs alertmanager-prometheus-kube-prometheus-
alertmanager-0 --since=10m
    STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:40.660Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="monitoring/pod-restarts-receiver/email/email[0]: notify retry canceled after 13 attempts: send STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:40.707Z caller=notify.go:732 level=warn component=dispatcher receiver=monitoring/pod-restarts-receiver/email integration=email[0] msg="Notify attempt failed, will retry later" attempts=1 err="send STARTTLS command: x509: certificate signed by unknown authority"
    ts=2023-07-23T11:18:41.380Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
    ts=2023-07-23T11:18:41.390Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml

Th AlertManagerConfig needs to be updated with the requireTLS flag to false :

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: pod-restarts-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: email
    routes:
      - matchers:
        - name: job
          value: pod-restarts
        receiver: email
  receivers:
    - name: email
      emailConfigs:
        - to: 'etshuma@mycompany.com'
          from: 'ops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false
Golide
  • 835
  • 3
  • 13
  • 36