2

I am using the common monitoring tools (Prometheus, cAdvisor, AlertManager), and I faced this issue that one of the servers firing each 30min containerCpuUsage but unfortunately I do not know which container is this (I am guessing this is the cAdvisor itself, but the cpu usage is really low on it!!) so my first question is, is there any way to tell AlertManager - base on prometheus rules - to send also the container name?

(cAdvisor itself using more CPU than the other containers)

cadvisor-rule.yaml

- alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU usage (instance {{ $labels.instance }})"
      description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I've tried {{ $labels.name }} and {{ $labels.job }} but not working.

so let's call the instance name is A and then there is a nginx & cadvisor container inside it. Monitoring tools are running on the other instance, how can I get container names into rules labels or if there is other way to do it!

MK83
  • 1,652
  • 1
  • 11
  • 11

1 Answers1

1

In cAdvisor it is stated that the container itself can take a bit more CPU sometimes.

  # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
  # If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}

In my case, I started the cAdvisor container with --name=cadvisor and added following as a rule expression:

expr: (sum(rate(container_cpu_usage_seconds_total{name!="cadvisor"}[3m])) BY (instance, name) * 100) > 80
Dharman
  • 30,962
  • 25
  • 85
  • 135