I have a 4 node K8s cluster set up via kubeadm on a local VM cluster. I am using the following:
- Kubernetes 1.24
- Helm 3.10.0
- kube-prometheus-stack Helm chart 41.7.4 (app version 0.60.1)
When I go into either Prometheus or Alertmanager, there are many alerts that are always firing. Another thing to note is that Alertmanager "cluster status" is reporting as "disabled". Not sure what bearing (if any) that may have on this. I have not added any new alerts of my own - everything was presumably deployed with the Helm chart.
I do not understand what these alerts are triggering for other than what I can glean from their names. It does not seem a good thing that these alerts should be firing. Either there is something seriously wrong with the cluster or something is poorly configured in the alerting configuration of the Helm chart. I'm leaning toward the second case, but will admit, I really don't know.
Here is a listing of the firing alerts, along with label info:
etcdMembersDown
alertname=etcdMembersDown, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
etcdInsufficientMembers
alertname=etcdInsufficientMembers, endpoint=http-metrics, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
TargetDown
alertname=TargetDown, job=kube-scheduler, namespace=kube-system, service=prometheus-stack-kube-prom-kube-scheduler, severity=warning
alertname=TargetDown, job=kube-etcd, namespace=kube-system, service=prometheus-stack-kube-prom-kube-etcd, severity=warning
alertname=TargetDown, job=kube-proxy, namespace=kube-system, service=prometheus-stack-kube-prom-kube-proxy, severity=warning
alertname=TargetDown, job=kube-controller-manager, namespace=kube-system, service=prometheus-stack-kube-prom-kube-controller-manager, severity=warning
KubePodNotReady
alertname=KubePodNotReady, namespace=monitoring, pod=prometheus-stack-grafana-759774797c-r44sb, severity=warning
KubeDeploymentReplicasMismatch
alertname=KubeDeploymentReplicasMismatch, container=kube-state-metrics, deployment=prometheus-stack-grafana, endpoint=http, instance=192.168.42.19:8080, job=kube-state-metrics, namespace=monitoring, pod=prometheus-stack-kube-state-metrics-848f74474d-gp6pw, service=prometheus-stack-kube-state-metrics, severity=warning
KubeControllerManagerDown
alertname=KubeControllerManagerDown, severity=critical
KubeProxyDown
alertname=KubeProxyDown, severity=critical
KubeSchedulerDown
alertname=KubeSchedulerDown, severity=critical
Here is my values.yaml:
defaultRules:
create: true
rules:
alertmanager: true
etcd: true
configReloaders: true
general: true
k8s: true
kubeApiserverAvailability: true
kubeApiserverBurnrate: true
kubeApiserverHistogram: true
kubeApiserverSlos: true
kubeControllerManager: true
kubelet: true
kubeProxy: true
kubePrometheusGeneral: true
kubePrometheusNodeRecording: true
kubernetesApps: true
kubernetesResources: true
kubernetesStorage: true
kubernetesSystem: true
kubeSchedulerAlerting: true
kubeSchedulerRecording: true
kubeStateMetrics: true
network: true
node: true
nodeExporterAlerting: true
nodeExporterRecording: true
prometheus: true
prometheusOperator: true
prometheus:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
hosts:
- prometheus.<hidden>
paths:
- /
pathType: ImplementationSpecific
grafana:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.<hidden>
path: /
persistence:
enabled: true
size: 10Gi
alertmanager:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
hosts:
- alerts.<hidden>
paths:
- /
pathType: ImplementationSpecific
config:
global:
slack_api_url: '<hidden>'
route:
receiver: "slack-default"
group_by:
- alertname
- cluster
- service
group_wait: 30s
group_interval: 5m # 5m
repeat_interval: 2h # 4h
routes:
- receiver: "slack-warn-critical"
matchers:
- severity =~ "warning|critical"
continue: true
receivers:
- name: "null"
- name: "slack-default"
slack_configs:
- send_resolved: true # false
channel: "#alerts-test"
- name: "slack-warn-critical"
slack_configs:
- send_resolved: true # false
channel: "#alerts-test"
kubeControllerManager:
service:
enabled: true
ports:
http: 10257
targetPorts:
http: 10257
serviceMonitor:
https: true
insecureSkipVerify: "true"
kubeEtcd:
serviceMonitor:
scheme: https
servername: <do I need it - don't know what this should be>
cafile: <do I need it - don't know what this should be>
certFile: <do I need it - don't know what this should be>
keyFile: <do I need it - don't know what this should be>
kubeProxy:
serviceMonitor:
https: true
kubeScheduler:
service:
enabled: true
ports:
http: 10259
targetPorts:
http: 10259
serviceMonitor:
https: true
insecureSkipVerify: "true"
Is there something wrong with this configuration? Are there any Kubernetes objects that might be missing or misconfigured? It seems very odd that one could install this Helm chart and experience this many "failures". Is there perhaps, a major problem with my cluster? I would think that if there was really something wrong with etcd, the kube-scheduler or kube-proxy that I would experience problems everywhere, but I am not.
If there is any other information I can pull from the cluster or related artifacts that might help, let me know and I will include them.