1

My pc has a 10th gen Core i7 vPRO with virtualization enabled. 8 cores + 8 virtual cores. (i7-10875H, Comet Lake)

Each physical core is split into pairs, so Core 1 hosts virtual cores 0 & 1, core 2 hosts virtual cores 2 & 3. I've noticed that in task manager, the first item of each core pair seems to be the preferred core, judging by the higher usage. I do set some affinities manually for certain heavy programs but I always set these in groups of 4, either from 0-3, 4-7, 8-11, 12-15, and never mismatch different logical processors.

I'm wondering why this behaviour happens - do the even numbered cores equate to physical cores, which could be slightly faster? If so, would I get slightly better clock speeds without virtualisation if I'm running programs that don't have a high thread count?

enter image description here enter image description here

AutoBaker
  • 919
  • 2
  • 15
  • 31
  • 3
    All your cores are logical cores. Pairs of them share a physical core, and your OS might prefer one of the two siblings of each pair as a way to minimize having tasks competing for the same physical core. – Peter Cordes Feb 02 '21 at 19:31
  • 1
    Your Comet Lake CPU has Turbo 3.0 which identifies one or more physical cores as "special", capable of higher clock speeds. https://www.intel.ca/content/www/ca/en/architecture-and-technology/turbo-boost/turbo-boost-max-technology.html This might possibly be part of it for low thread-count workloads, but wouldn't explain preferring one hyper-sibling to another since that's totally symmetric. – Peter Cordes Feb 02 '21 at 19:37
  • And BTW, this is not a programming question; seems to belong on https://SuperUser.com/. Voted to migrate (not close) since it's clear and on-topic there. – Peter Cordes Feb 02 '21 at 19:38
  • 2
    Do you mean to ask if hyperthreading doubles CPU capacity? It doesn't. Two HT threads are on average 30% faster than a single one. So it makes sense to spread the work among physical cores first, and use HT as a last resort. Also if we count from 0, it'll be even-numbered cores. – rustyx Feb 02 '21 at 19:44
  • @rustyx: Does Windows always enumerate cores with hyper-siblings adjacent? (Or maybe you're just talking about the OP's system specifically). On some motherboards, Linux numbers the cores 0..n-1 for one logical core of each physical core, and then n..2n-1 for the other core. So 0 and 4 are siblings on my i7-6700k (4c8t) for example. i.e. "core id" in /proc/cpuinfo code 0,1,2,3,0,1,2,3 instead of 0,0,1,1,2,2,3,3. But like I said, this is just how the OS chooses to enumerate; I'm not sure that matches any hard-wired numbering in the BIOS or hardware. – Peter Cordes Feb 02 '21 at 20:57
  • @PeterCordes It seems to at least on my system: `Core 0: mask 0x3, Core 1: mask 0xc, Core 2: mask 0x30, Core 3: mask 0xc0`. Unfortunately [godbolt](https://gcc.godbolt.org/z/raWKxa) can't run win32. – rustyx Feb 02 '21 at 23:12
  • @rustyx: Some other people's Linux systems report the cores in the same order as you and the OP, like 0,0, 1,1, etc. So that can happen by chance. Testing multiple systems would be necessary to show that it consistently does that (especially on systems where Linux does enumerate it the way mine does), or finding some documentation. – Peter Cordes Feb 02 '21 at 23:46
  • @rustyx I know that virtualization doesn't increase capacity. I was actually wondering if it would actually slow down clock speeds slightly over against using no virtualisation. If my programs don't use more then 8 threads I'm wondering if there's much point having virtualisation enabled. – AutoBaker Feb 03 '21 at 07:42
  • Having virtualization enabled or not has *nothing* to do with the number of threads in your normal or peak workloads. It's totally orthogonal to hyperthreading. For example, a couple logical cores can be running a virtual guest OS (with paging via HW supported nested page tables, with the hardware handling guest page tables and host page tables), while other logical cores are running something else. Code in a guest VM will run slightly slower than in an OS on bare metal because of more expensive page-walks, and occasional VM-exits for some things, but clock frequency is the same. – Peter Cordes Feb 03 '21 at 08:24

2 Answers2

5

In general (for "scheduler theory"):

  • if you care about performance, spread the tasks across physical cores where possible. This prevents a "2 tasks run slower because they're sharing a physical core, while a whole physical core is idle" situation.

  • if you care about power consumption and not performance, make tasks use logical processors in the same physical core where possible. This may allow you to put entire core/s into a very power efficient "do nothing" state.

  • if you care about security (and not performance or power consumption), don't let unrelated tasks use logical processors in the same physical core at all (because information, like what kinds of instructions are currently being used, can be "leaked" from one logical processor to another logical process in the same physical core). Note that it would be fine for related tasks to use logical processes in the same physical core (e.g. 2 threads that belong to the same process that do trust each other, but not threads that belong to different processes that don't trust each other).

Of course a good OS would know the preference for each task (if each task cares about performance or power consumption or security), and would make intelligent decisions to handle a mixture of tasks with difference preferences. Sadly there are no good operating systems - most operating systems and APIs were designed in the 1990s or earlier (back when SMP was just starting and all CPUs were identical anyway) and lack the information about tasks that would be necessary to make intelligent decisions; so they assume performance is the only thing that matters for all tasks, leading to the "tasks spread across physical cores where possible, even when it's not ideal" behavior you're seeing.

Brendan
  • 35,656
  • 2
  • 39
  • 66
1

My guess is that's due to hyperthreading.

Hyperthreading doesn't double CPU capacity (according to Intel, it adds ~30% on average), so it makes sense to spread the work among physical cores first, and use hyperthreading as a last resort when the overall CPU demand starts exceeding 50%.

Fun fact: a reported 50% overall CPU load on a hyperthreaded system is in fact a load of around ~70%, and the remaining 50% equate to the remaining ~30%.

If we query the OS to see how logical processors are assigned to cores1, we will see a situation like this:

Core 0: mask 0x3
Core 1: mask 0xc
Core 2: mask 0x30
Core 3: mask 0xc0
. . .

That means logical processors 0 and 1 are on core 0, 2 and 3 on core 1, etc.

You can disable hyperthreading in the BIOS. But since it adds performance, it's is a nice to have feature. Just need to be careful not to pin work such that it is competing for the same core.


1 To check core assignment I use a small C program below. The information might also be available via WMIC.

#include <stdio.h>
#include <stdlib.h>
#undef _WIN32_WINNT
#define _WIN32_WINNT 0x601
#include <Windows.h>

int main() {
    DWORD len = 65536;
    char *buf = (char*)malloc(len);
    if (!GetLogicalProcessorInformationEx(RelationProcessorCore, (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf, &len)) {
        return GetLastError();
    }
    union {
        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info;
        PBYTE infob;
    };
    info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf;
    for (size_t i = 0, n = 0; n < len; i++, n += info->Size, infob += info->Size) {
        switch (info->Relationship) {
        case RelationProcessorCore:
            printf("Core %zd:", i);
            for (int j = 0; j < info->Processor.GroupCount; j++)
                printf(" mask 0x%llx", info->Processor.GroupMask[j].Mask);
            printf("\n");
            break;
        }
    }
    return 0;
}
rustyx
  • 80,671
  • 25
  • 200
  • 267
  • Trying to interpret 50% load (with one core on each physical core busy) as "70%" is *very* sketchy. It really depends on the actual workload how much extra work each core could be doing. High-throughput (instructions per clock) code, especially tuned for L1 cache footprint, like a BLAS matmul, might see negative scaling with hyperthreading. i.e. better overall throughput with HT disabled, or with one thread per physical core. Very branchy code, like perhaps some compression algorithms, might see better scaling. (OTOH, some high-throughput code like x264/x265 scales some, like 15-20%). – Peter Cordes Feb 03 '21 at 22:00
  • I must admit I've yet to see negative scaling with HT. Just speaking from experience. We've been doing some measurements on various types of (server) workloads on various Xeon and Core i7/i9 generations and the difference between the "first 50%" and "last 50%" of reported vs. actual CPU performance was always between 70/30 and 80/20. Also I've seen my fair share of massive downtime caused by people assuming there's plenty of "spare" capacity on a server "loaded at 50%". – rustyx Feb 03 '21 at 22:27
  • Most workloads do scale positively, but I think it's not rare for carefully tuned HPC code, especially floating-point number crunching. Or maybe used to be more common when memory bandwidth was lower so cache hit rate was even more valuable, and OoO exec window sizes (ROB and RS) were smaller, so partitioning or competitively sharing would also reduce available ILP within each thread. Especially before Haswell when there was only one SIMD `mul` and one `add` unit, vs. two FMA units, so saturating it wasn't as hard. e.g. Zen is very wide and has lots of back-end throughput resources. – Peter Cordes Feb 03 '21 at 22:48