0

I have a 4 node K8s cluster set up via kubeadm on a local VM cluster. I am using the following:

  • Kubernetes 1.24
  • Helm 3.10.0
  • kube-prometheus-stack Helm chart 41.7.4 (app version 0.60.1)

When I go into either Prometheus or Alertmanager, there are many alerts that are always firing. Another thing to note is that Alertmanager "cluster status" is reporting as "disabled". Not sure what bearing (if any) that may have on this. I have not added any new alerts of my own - everything was presumably deployed with the Helm chart.

I do not understand what these alerts are triggering for other than what I can glean from their names. It does not seem a good thing that these alerts should be firing. Either there is something seriously wrong with the cluster or something is poorly configured in the alerting configuration of the Helm chart. I'm leaning toward the second case, but will admit, I really don't know.

Here is a listing of the firing alerts, along with label info:

etcdMembersDown
    alertname=etcdMembersDown, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
etcdInsufficientMembers
    alertname=etcdInsufficientMembers, endpoint=http-metrics, job=kube-etcd, namespace=kube-system, pod=etcd-gagnon-m1, service=prometheus-stack-kube-prom-kube-etcd, severity=critical
TargetDown
    alertname=TargetDown, job=kube-scheduler, namespace=kube-system, service=prometheus-stack-kube-prom-kube-scheduler, severity=warning
    alertname=TargetDown, job=kube-etcd, namespace=kube-system, service=prometheus-stack-kube-prom-kube-etcd, severity=warning
    alertname=TargetDown, job=kube-proxy, namespace=kube-system, service=prometheus-stack-kube-prom-kube-proxy, severity=warning
    alertname=TargetDown, job=kube-controller-manager, namespace=kube-system, service=prometheus-stack-kube-prom-kube-controller-manager, severity=warning
KubePodNotReady
    alertname=KubePodNotReady, namespace=monitoring, pod=prometheus-stack-grafana-759774797c-r44sb, severity=warning
KubeDeploymentReplicasMismatch
    alertname=KubeDeploymentReplicasMismatch, container=kube-state-metrics, deployment=prometheus-stack-grafana, endpoint=http, instance=192.168.42.19:8080, job=kube-state-metrics, namespace=monitoring, pod=prometheus-stack-kube-state-metrics-848f74474d-gp6pw, service=prometheus-stack-kube-state-metrics, severity=warning
KubeControllerManagerDown
    alertname=KubeControllerManagerDown, severity=critical
KubeProxyDown
    alertname=KubeProxyDown, severity=critical
KubeSchedulerDown
    alertname=KubeSchedulerDown, severity=critical

Here is my values.yaml:

defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    configReloaders: true
    general: true
    k8s: true
    kubeApiserverAvailability: true
    kubeApiserverBurnrate: true
    kubeApiserverHistogram: true
    kubeApiserverSlos: true
    kubeControllerManager: true
    kubelet: true
    kubeProxy: true
    kubePrometheusGeneral: true
    kubePrometheusNodeRecording: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    kubeSchedulerAlerting: true
    kubeSchedulerRecording: true
    kubeStateMetrics: true
    network: true
    node: true
    nodeExporterAlerting: true
    nodeExporterRecording: true
    prometheus: true
    prometheusOperator: true

prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - prometheus.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific

grafana:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.<hidden>
    path: /
  persistence:
    enabled: true
    size: 10Gi

alertmanager:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - alerts.<hidden>
    paths:
      - /
    pathType: ImplementationSpecific
  config:
    global:
      slack_api_url: '<hidden>'
    route:
      receiver: "slack-default"
      group_by:
        - alertname
        - cluster
        - service
      group_wait: 30s
      group_interval: 5m # 5m
      repeat_interval: 2h # 4h
      routes:
        - receiver: "slack-warn-critical"
          matchers:
            - severity =~ "warning|critical"
          continue: true
    receivers:
      - name: "null"
      - name: "slack-default"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"
      - name: "slack-warn-critical"
        slack_configs:
          - send_resolved: true # false
            channel: "#alerts-test"

  kubeControllerManager:
    service:
      enabled: true
      ports:
        http: 10257
      targetPorts:
        http: 10257
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

  kubeEtcd:
    serviceMonitor:
      scheme: https
      servername: <do I need it - don't know what this should be>
      cafile: <do I need it - don't know what this should be>
      certFile: <do I need it - don't know what this should be>
      keyFile: <do I need it - don't know what this should be>

  kubeProxy:
    serviceMonitor:
      https: true

  kubeScheduler:
    service:
      enabled: true
      ports:
        http: 10259
      targetPorts:
        http: 10259
    serviceMonitor:
      https: true
      insecureSkipVerify: "true"

Is there something wrong with this configuration? Are there any Kubernetes objects that might be missing or misconfigured? It seems very odd that one could install this Helm chart and experience this many "failures". Is there perhaps, a major problem with my cluster? I would think that if there was really something wrong with etcd, the kube-scheduler or kube-proxy that I would experience problems everywhere, but I am not.

If there is any other information I can pull from the cluster or related artifacts that might help, let me know and I will include them.

Joseph Gagnon
  • 1,731
  • 3
  • 30
  • 63

0 Answers0