I have a deployment of a pod that continuously goes into CrashLoopBackoff state. I have setup an alert forthis event but the alert is not firing on the configured receiver. The alert is only firing on the default AlertManager receiver that is configured with each AlertManager deployment.
The AlertManager deployment is part of a bitnami/kube-prometheus stack deployment.
I have added the custom receiver to which the alert should also be sent.This receiver is essentially an email recipient and it has the following definition:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: pod-restarts-receiver
namespace: monitoring
labels:
alertmanagerConfig: email
release: prometheus
spec:
route:
receiver: 'email-receiver'
groupBy: ['alertname']
groupWait: 30s
groupInterval: 5m
repeatInterval: 5m
matchers:
- name: job
value: pod-restarts
receivers:
- name: 'email-receiver'
emailConfigs:
- to: 'etshuma@mycompany.com'
sendResolved: true
from: 'ops@mycompany.com'
smarthost: 'mail2.mycompany.com:25'
This alert is triggered by the PrometheusRule below :
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-restarts-alert
namespace: monitoring
labels:
app: kube-prometheus-stack
release: prometheus
spec:
groups:
- name: api
rules:
- alert: PodRestartsAlert
expr: sum by (namespace, pod) (kube_pod_container_status_restarts_total{namespace="labs", pod="crash-loop-pod"}) > 5
for: 1m
labels:
severity: critical
job: pod-restarts
annotations:
summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has more than 5 restarts"
description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has experienced more than 5 restarts."
I have extracted the definition of the default receiver in AlertManager pod as follows:
kubectl -n monitoring exec -it alertmanager-prometheus-kube-prometheus-alertmanager-0 --
sh
cd conf
cat config.yaml
And config.yaml has the following definition:
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
And I have also extracted from the AlertManager UI , the global configuration of the deployment. As expected it shows that the new alert receiver has been added :
global:
resolve_timeout: 5m
http_config:
follow_redirects: true
enable_http2: true
smtp_hello: localhost
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
telegram_api_url: https://api.telegram.org
webex_api_url: https://webexapis.com/v1/messages
route:
receiver: "null"
group_by:
- job
continue: false
routes:
- receiver: monitoring/pod-restarts-receiver/email-receiver
group_by:
- alertname
match:
job: pod-restarts
matchers:
- namespace="monitoring"
continue: true
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
- receiver: "null"
match:
alertname: Watchdog
continue: false
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: "null"
- name: monitoring/pod-restarts-receiver/email-receiver
email_configs:
- send_resolved: true
to: etshuma@mycompany.com
from: ops@mycompany.com
hello: localhost
smarthost: mail2.mycompany.com:25
headers:
From: ops@mycompany.com
Subject: '{{ template "email.default.subject" . }}'
To: etshuma@mycompany.com
html: '{{ template "email.default.html" . }}'
require_tls: true
templates: []
EDIT
I have a number of questions from the global config for AlertManager :
- Strangely enough in the global configuration my receivers are "null". (Why?)
- The topmost section of the global config doesn't have any mail settings (Could this be an issue?).
- I'm not sure if the mail settings defined at AlertManagerConfig level work or even how to update the global config file (its accessible from the the pod only). I have looked at the values.yaml file used to spin up the deployment and it doesn't have any options for smarthost or any mail settings
- There is an additional matcher in the global config file named
- namespace="monitoring"
. Do I need to add a similar namespace label in the PrometheusRule ? . - Does it mean AlertManagerConfig has to be in the same namespace as the PrometheusRule and the target pod
The AlertConfigManager is also not visualizing anything at https://prometheus.io/webtools/alerting/routing-tree-editor/
What exactly am I missing ?
– Srividya Jul 13 '23 at 18:52