1

I'm monitoring containers CPU usage with cAdvisor using the following expression in prometheus:

(sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80

This alert is firing constantly for one of my containers as it's in fact using over 80% of CPU but on a single core only. My host has multiple cores and I would like to divide this percentage over the number of cores. I can see that cAdvisor is exporting a metric called machine_cpu_cores which I thought would help me but unfortunately, I can't get it to work. I've tried:

(sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) / sum(machine_cpu_cores) * 100) > 0

Unfortunately, it is returning an empty query result. Also, I don't have any limits set up on containers. What am I doing wrong here?

dywan666
  • 385
  • 8
  • 14

2 Answers2

2

The following query should return per-container CPU usage in percentage (it is assumed that containers have no CPU limits, so they could use all the CPU cores available on every host (node)):

100 * (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
    / on(node) group_left()
  machine_cpu_cores
)

It works in the following way:

  1. It calculates the average per-container CPU usage over the last 5 minutes with rate(container_cpu_usage_seconds_total{container!=""}[5m]). The {container!=""} filter is needed for filtering out cgroups hierarchy - see this answer for details.

  2. It divides the per-container CPU usage by the number of per-node CPU cores (aka host or instance). See docs for on() and group_left() modifiers.

  3. It multiplies the relative per-container CPU usage by 100 in order to get percentage in the range [0 .. 100].

If the query doesn't return results, then try substituting node with instance in the query above. The node label is usually used in Kubernetes, while instance label may be used in other environments where cAdvisor runs.

valyala
  • 11,669
  • 1
  • 59
  • 62
0

It can be tricky to understand what PromQL is doing and one great way to understand and debug queries is using PromLens. If you plug in your query there and switch to the "Explain" tab you see what's happening: there's a label mismatch which you can address using the ignoring() keyword, so something like the following should work:

sum by(instance,name) (
 rate(container_cpu_usage_seconds_total[3m])
)  
/ ignoring(job)
machine_cpu_cores
Michael Hausenblas
  • 13,162
  • 4
  • 52
  • 66
  • I feel a little ashamed that I've just learned about PromLens. What an awesome tool! However, I've tried modyfing my query but I still can't wrap my head around labels and why there is a mismatch. My query currently looks like this: ```sum by(instance,name) (rate(container_cpu_usage_seconds_total[3m])) * 100 / sum by (instance) (machine_cpu_cores)``` In the "Explain" tab I can see that left labels are instance and name, and right labels are missing and there is no match. Only one result is displayed, seems like an aggregation of some sort but I have no clue – dywan666 Jul 01 '21 at 16:57
  • Did you try my query above? Did you look on the "Explain" tab in PromLens? :) – Michael Hausenblas Jul 01 '21 at 17:01
  • Yes I did, sorry I was just editing the comment as I've made a mess. I tried your query but it is not returning what I want. In the demo example, it returns 9 rows and only the first row returns a result and yes it divides the left value by right value but it's incorrect - I need this values for every single container running on a node and it seems like this is an aggregated result or something, I'm not sure. – dywan666 Jul 01 '21 at 17:07
  • Roger that and yes you'd need to adapt to your data but the key is that with `ignoring` you can essentially tell PromQL which labels to ignore so that it can match left with right in a meaningful way. Not sure about the aggregate but the best lead I got is https://github.com/kubernetes/kube-state-metrics/issues/1101 – Michael Hausenblas Jul 01 '21 at 17:15
  • Yes but in my case the label from the left is missing on the right therefore I can't do a match. `machine_cpu_cores` is just a single metric with a single value, whereas `container_cpu_usage_seconds_total` metric count is based on the number of containers running. And it has an additional 'name' label which is the container name. Therefore, no match :( – dywan666 Jul 01 '21 at 17:26