Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

Question

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?

That is, CPU usage is displayed as decrease only when logged in Idle?

And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?

Clarification: shows CPU-usage in standard Task manager in OS-Windows

The question isn't clear - should show where? which program do you use for monitoring? You can easily detect this case with performance monitors (vtune, perf, or any other profiling tool) — Leeor, Nov 14 '13 at 15:00
Doesn't it have a graph of physical memory usage (under the performance tab)? That should measure RAM, but L3 access time is of a much lesser magnitude. — Leeor, Nov 14 '13 at 15:22
The processor core will eventually stall, it counts as full load. — Hans Passant, Nov 14 '13 at 17:38
@Hans Passant Thanks! If you or anyone has a more detailed answer why this is the case, and when is the load(executing CPU instructions, wait cache miss, ...) and when is not it (Idle), then please write the answer. — Alex, Nov 14 '13 at 18:56
@Leeor I am interested about CPU-usage, because when there is cache miss the CPU then doesn't execute any CPU-instructions and does it mean that CPU is in Idle? — Alex, Nov 14 '13 at 18:58
Since most CPUs can perform instructions out of order, it usually means that they have sufficient independent accesses that can be performed in parallel. A single miss doesn't mean the entire CPU stalls, let alone become idle. — Leeor, Nov 14 '13 at 21:22
@Leeor But what about very much misses, when it take 90% of CPU-time, whether it will affect to the CPU-usage? ? — Alex, Nov 14 '13 at 22:06

Peter Cordes · Accepted Answer · 2015-06-26T00:33:56.837

2

A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.

This is true across all architectures.

Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".

Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.

edited Jun 26 '15 at 00:33

answered Jun 25 '15 at 15:41

Peter Cordes

328,167
45
605
847

I think you are confusing what an OS show as "CPU usage" with more low level concepts like IPS, Throughput and Latency of instructions. – Jun 25 '15 at 16:50
Thank you! I.e. Intel x86_64 independently Virtual (Logical) or Physical (Hardware) Core which is stalled on a cache miss can shows 100% CPU-usage in OS. And is this true for the hyperthreading - when the first logical core waiting for the second core (on the same hardware-core), then the first logical core shows busy too? – Alex Jun 25 '15 at 18:02
@knm241: When I say "stalled", I mean out-of-order execution is stalled because it's run out of instructions that aren't in the dependency chain of the stalled load. I'm pointing out that IPC / pipeline stalls / code efficiency doesn't have an impact on whether an OS treats a core as "busy". – Peter Cordes Jun 25 '15 at 20:57
1

@Alex: each hardware thread is independent. With hyperthreading, each physical core has two hardware threads, aka logical cores. There's no such thing as "the first core waiting for the second core". One core might be running code that's waiting for another thread to release a lock, but that's not the same thing. (Usual lock behaviour is to tell the OS we're asleep between checks for a lock being free.) – Peter Cordes Jun 25 '15 at 21:01
@Peter Cordes Maybe it's my lack of understanding of Hyper Threading and is another question, but can one logical core be stalled to waiting for release hardware resources(ALU, Ports, ...) of another logical core , if they both belong to the same physical core? – Alex Jun 25 '15 at 21:44
But I still don't see the link with OS usage statistics. OS CPU usage just tells how much time the CPU spends executing user mode (and in context kernel) code per unit time. The OS doesn't measure the CPU usage by counting stalls or something, these are architecture transparent. – Jun 25 '15 at 22:30
1

@knm241: I think see the disconnect between how you were reading what I actually wrote, and what I meant to say. I added a word to my answer: "still counts", to make it clear that the answer to the OPs question is as you say: time spent executing user code is time spent, regardless of how many stalls that code experiences. – Peter Cordes Jun 26 '15 at 00:35
1

@Alex: Intel CPUs with hyperthreading competitively share most execution resources, or have them partitioned. In the frontend, the decoders / uop cache alternates each cycle between the instruction streams of the two threads, but then the OoO logic just does what it always does, and runs the oldest uops that have their operands ready. See the microarch doc from http://agner.org/optimize/ for more details. I'm not sure whether HT is smart enough recognize that one thread is stalled (on a cache miss or something, not with a `PAUSE` insn), and use every frontend cycle on the other thread – Peter Cordes Jun 26 '15 at 00:40
I have written an answer to better explain my point to you, would you please comment it? To me it seems that, as an analogy, the OP is asking if time spent reading documentation for a framework is to be considered work time or free time. It can be both, *where* you are doing it. So it is with stalls, they can be idle or busy time depending where they happens. – Jun 26 '15 at 07:00

score 0 · Answer 2 · answered Jun 26 '15 at 07:01

0

I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.

User programs relinquish control to OS very often: when they need input from user or when they are done with the signal/message. GUI program are basically just big loops and at each iteration control is given to the OS until the next message. When the OS has the control it schedules others threads/tasks and if not other actions are needed just enter the idle process (long time ago a tight loop, now a sleep state) until the next interrupt. This is the Idle Time.

Time spent on an ISR processing user input is considered idle time by any OS. An a cache miss there would be still considered idle time.

A heavy program takes more time to complete the work for a given message thereby returning control to OS say 2 times in a second instead of 20.
If the OS measures that in the last second, it got control for 20ms only then the CPU usage is (1000-20)/1000 = 98%.

This has nothing to do with the optimal use of the CPU architecture, as said stalls can occur in the OS code and still be part of the Idle time statistic. The CPU utilization at pipeline level is not what is measured and it is orthogonal to the OS statistics.

CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system, it is not the measure of how efficiently the assembly of a program was generated. Sysadmins can't help with that, but measuring how often the OS got the control back (without preempting) is a measure of how much load a program is putting on the system. And sysadmins can definitively do terminate heavy programs.

answered Jun 26 '15 at 07:01

1

You're correct that optimal use of the pipeline is uncorrelated with OS CPU usage statistics. They are orthogonal, as you say. – Peter Cordes Jun 27 '15 at 16:39
1

You're wrong that time spent in kernel code (e.g. interrupt service routines) counts as idle. It's not "user" time for that process, but it is "system", not "idle" or i/o wait time. "idle" time is only when the CPU is actually paused waiting for an interrupt, not when it's actually handling the interrupt from the keypress or mouse move. Maybe some of my comments on my answer were talking about time accounting for a single process (i.e. just the "user" time for it.) Something like Windows Task Manager with charts of CPU utilization counts "system" time too, I think. – Peter Cordes Jun 27 '15 at 16:42
@PeterCordes ok, got it! Just for completation, I was thinking of this: when the a system timer fires an IRQ and the CPU wakes up the kernel has to update the measuring algorithm state so that now it measures system time instead of idle time and also store the idle time counted. if cache miss happens before the state is updated it counts as idle time. Right? Do you happen to have any info on how Linux measure these times? Thanks! – Jun 27 '15 at 17:52
1

I haven't looked at Linux's code for this. I hadn't really thought about this before, but yeah I guess the kernel needs CPU-time accounting every sleep and every wakeup, and that could be delayed by a cache miss. Unless it uses the CPU performance counters to count clock-cycles-unhalted or something, instead of starting and stopping a stopwatch all the time. – Peter Cordes Jun 28 '15 at 15:16
1

One other thing: servicing interrupts never happens in the same thread as a user process. Processes make system calls, and time spent in kernel code on behalf of a user process goes to that process's "system" CPU time. Interrupt service routines are overall "system" CPU time, but not accounted toward any process. When a process sleeps on input (e.g. `select(2)` / `poll(2)` Unix system calls, or Windows wait-for-next-window-message call), it's not an interrupt service routine that delivers the message directly. An ISR usually does the minimum work possible, and other code queues the message – Peter Cordes Jun 28 '15 at 15:22
1

I checked on how Linux does load-average accounting: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cpu-load.txt#n24 explains how it works: Every timer interrupt, it checks what, if anything, got interrupted, and accounts time accordingly. (This is a LOT more efficient than a `rdtsc` on every context switch! It's more important for CPU time accounting to be lightweight than for it to be accurate. At some point, it sorts out idle vs. io-wait, I guess. The scheduler may do its own tracking of giving CPU timeslices to threads, but loadavg is separate. – Peter Cordes Jun 29 '15 at 03:27
1

Thank you @PeterCordes, I appreciate your efforts – Jun 29 '15 at 09:18

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

2 Answers2

Linked