I'm experiencing a widespread issue with Prometheus to Alertmanager communication. Whenever an Alertmanager pod restarts, server logs 503 error to that individual pod. Other AM pods receive the alerts until they get restarted.
Prometheus Version: 2.42.0
Alertmanager Version: 0.25.0
Istio Version: v1.17
Issue Description
I'm using Istio mesh to connect Prometheus to Alertmanager. Whenever an Alertmanager pod gets restarted, I get the following error. If I restart the Prometheus server error goes away, and able to establish a new connection to Alertmanager. Looks like Prometheus is caching these IPs, not getting fully closed.
ts=2023-03-07T21:34:40.312Z caller=scrape.go:1351 level=debug component="scrape manager" scrape_pool=alertmanager target=http://am-0.monitoring.svc.cluster.local:9093/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"
alerting configuration:
alerting:
alert_relabel_configs:
- action: labeldrop
regex: replica
replacement: $1
separator: ;
alertmanagers:
- static_configs:
- targets:
- am-0.monitoring.svc.cluster.local:9093
- am-1.monitoring.svc.cluster.local:9093
- am-2.monitoring.svc.cluster.local:9093
Can you please help with this?