I've been using the excellent atop for reviewing load test impact in detail, and the distinction between the SystemLevel/CPU metric in the top (system-wide) section and the ProcessLevel/CPU metric in the bottom (per-process) section has me baffled. I'm aware of similar questions but I've not found one that explains using context that I already understand.
1. Is it % of available capacity, or % of used capacity?
The ProcessLevel/CPU metric is described in the manpage as "The occupation percentage of this process related to the available capacity for this resource on system level." Contrast this with:
ProcessLevel/DSK ("The occupation percentage of this process related to the total load that is produced by all processes (i.e. total disk accesses by all processes during the last interval)."), and with
top's apparent "CPU Usage" equivalent ("The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time.")
Both of these seem to refer to the occupation percentage related to the "... used capacity...", which is quite different from "... available capacity...". Assuming the manpage description is right...
2. What is "available capacity"?
If it's really monitoring "available capacity", what does this mean? The illustration below, with ProcessLevel/CPU at 97% when its CPU is 43% idle, seems to show that it can't be closely related to SystemLevel/CPU maxima. Could it be taking account of disk or network wait time?
3. How can it be >100%?
Is this just statistical/sampling error? Is it subject to the same "100% = 1 maxed CPU" as top's %CPU? If so, how can our single-threaded pal node use >100% in the final sample below?
Illustrative Samples
By way of illustration (might open additional node/xen/aws wormcans, in which case sorry, but I'd still appreciate SF's wisdom and am happy to spawn other questions)...
The node app I'm testing (with the also-excellent vegeta) handles a particular uploady request type happily on a 4-CPU AWS instance at a rate of 10req/s. Under this load, node's ProcessLevel/CPU is around 69%:
CPU sys: 73% | user: 155% | irq: 6% | idle: 143% | wait: 21% | steal: 1%
cpu sys: 18% | user: 36% | irq: 0% | idle: 43% | cpu002 w: 3% | steal: 0%
... blah ... CPUNR CPU CMD
2 69% node
(I'm assuming that the cpu002 corresponds to CPUNR=2).
Under a load of 14 req/s, which the server doesn't handle (35% timeouts), the ProcessLevel/CPU for node is up at 97%:
CPU sys: 59% | user: 142% | irq: 5% | idle: 170% | wait: 22% | steal: 1%
cpu sys: 14% | user: 40% | irq: 0% | idle: 42% | cpu003 w: 3% | steal: 0%
... blah ... CPUNR CPU CMD
3 97% node
So, if ProcessLevel/CPU means % of available CPU resource, how can node be using 97% of the available CPU resource when its CPU is 43% idle? Or (at the risk of drifting slightly off topic), if ProcessLevel/CPU means % of used CPU resource, why would this metric correspond so closely to a load max, when there's plenty of CPU spare and it's not waiting for disk (maxing the network adapter...)?
Final sample, for the >100% question, here is the same box getting really hammered at 16 req/s (ProcessLevel/CPU now up to 111% and all requests failing):
CPU sys: 44% | user: 125% | irq: 4% | idle: 203% | wait: 24% | steal: 1%
cpu sys: 12% | user: 38% | irq: 4% | idle: 41% | cpu001 w: 5% | steal: 0%
... blah ... CPUNR CPU CMD
1 111% node
Cheers!