5

I'm having issues with Prometheus alerting rules. I have various cAdvisor specific alerts set up, for example:

- alert: ContainerCpuUsage
  expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
  for: 2m
  labels:
    severity: warning
  annotations:
    title: 'Container CPU usage (instance {{ $labels.instance }})'
    description: 'Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

When the condition is met, I can see the alert in the "Alerts" tab in Prometheus, however some labels are missing thus not allowing alertmanager to send a notification via Slack. To be specific, I attach custom "env" label to each target:

 {
  "targets": [
   "localhost:8080",
  ],
  "labels": {
   "job": "cadvisor",
   "env": "production",
   "__metrics_path__": "/metrics"
  }
 }

But when the alert based on cadvisor metrics is firing, the labels are: alertname, instance and severity - no job label, no env label. All the other alerts from other exporters (f.e. node-exporter) work just fine and the label is present.

dywan666
  • 385
  • 8
  • 14

1 Answers1

13

This is due to the sum function that you use; it gathered all the time series present and added them together, groping BY (instance, name). If you run the same query in Prometheus, you will see that sum left only grouping labels:

{instance="foo", name="bar"}    135.38819037447163

Other aggregation methods like avg, max, min, etc, work in the same fashion. To bring the label back simply add env to the grouping list: by (instance, name, env).

anemyte
  • 17,618
  • 1
  • 24
  • 45
  • 1
    Thanks! I've modified my query to this: `(sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name,env) * 100) > 80` and it looks like it's working fine. Is this query okay? To be honest, I do not fully understand this: "But this way you'll get CPU utilisation per instance per name per environment." - why is that an issue? – dywan666 Apr 27 '21 at 09:29
  • 1
    Suppose you have a container with `env=prod` and another one with `env=dev` both on a single machine (instance). By running the query you'll get a distinct CPU utilisation for `env=dev` and `env=prod`. Since you made it so that only `env=prod` can trigger an alert, you won't get notified in case `env=dev` took all CPU resources on the machine. In other words, machine CPU Utilisation will be split between various `env` label values. Whether this is a problem depends on how things run in your environment, if there can be no other `env` except `prod` on production machines, then this is okay. – anemyte Apr 27 '21 at 10:31
  • oh an one more thing @anemyte, this env label is attached to the specific target (which is cadvisor) and not to the containers themselves. It would become a problem if I ran two cadvisor containers, with different env label values. At least that's how I understand it. – dywan666 Apr 27 '21 at 16:24
  • 1
    @dywan666 if it's explicitly defined in job configuration for production instances, then I suppose it should be fine. – anemyte Apr 27 '21 at 17:09