Sharing a TLB entry between two logical CPUs (Intel)

Question

I wondered if it is possible if two threads belonging to the same program with the same PCID can share the TLB entry when they are scheduled to run on the same physical CPU?

I already looked into the SDM (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html); page 3115 (TLB and HT) does not mention any sharing mechanism. But another part of the document states that before accessing the TLB entry, the PCID value is checked, and if it is equal, the value is used. However, there is also a bit for the current thread set next to the PCID identifier.

My question: is the PCID value used with priority over the CPU-thread bit or is it necessary that both values match?

Good question; you'd hope threads of the same process sharing a physical core could share TLB entries, in TLB levels that are competitively shared, not statically partitioned. But the semantics of `invlpg` might be a problem for allowing that. Or maybe not since speculative loads of a TLB can happen at any time, and that time could be due to the other logical core's activity. — Peter Cordes, May 18 '22 at 15:55
From my understanding, this could be possible and would allow some performance benefits. However, it is not stated anywhere. Yes, obviously I was thinking about the shared L1TLB. Not about the partitioned ones. Empirically verifying it can be really hard, since the is so much noise and the program has to be large enough to fill the differents sets in the cache — Benedict Schlüter, May 20 '22 at 07:01
Since you mention it, maybe not that hard to test. There are perf counters for L1dTLB misses, so pin two threads to the same physical core, and have them each repeatedly touch a working set a couple entries smaller than the full L1dTLB size. Like `dtlb_load_misses.stlb_hit` plus `dtlb_load_misses.miss_causes_a_walk` — Peter Cordes, May 20 '22 at 11:50
I conducted some experiments, would you interpret the results also like I did? — Benedict Schlüter, May 20 '22 at 15:37

Benedict Schlüter · Accepted Answer · 2022-10-19T19:21:53.393

From my observations, it is not possible (at least for the dTLB), even though it would bring performance benefits.

How I came to that conclusion

As suggested by Peter, I wrote a small program that consists of two worker threads that access the same heap region over and over again.

Compile with -O0 to prevent optimization.

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <sys/mman.h>

#define PAGE_SIZE 4096

int repetitions = 1ll << 20;
uint64_t ptrsize = 1ll<<18;
uint64_t main_cpu, co_cpu ;

void pin_task_to(int pid, int cpu)
{
    cpu_set_t cset;
    CPU_ZERO(&cset);
    CPU_SET(cpu, &cset);
    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
        err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }


void *foo(void *p)
{
    pin_to(main_cpu);

    int value;
    uint8_t *ptr = (uint8_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        for (size_t i = 0; i < ptrsize; i += PAGE_SIZE)
        {
            value += ptr[i];
        }
    }
    volatile int dummy = value;
    pthread_exit(NULL);
}

void *boo(void *p)
{
    pin_to(co_cpu);

    int value;
    uint8_t *ptr = (uint8_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        for (size_t i = 0; i < ptrsize; i+=PAGE_SIZE)
        {
            value += ptr[i];
        }
    }
    volatile int dummy = value;
    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    if (argc < 3){
        exit(-1);
    }
    main_cpu = strtoul(argv[1], NULL, 16);
    co_cpu = strtoul(argv[2], NULL, 16);
    pthread_t id[2];
    void *mptr = malloc(ptrsize);

    pthread_create(&id[0], NULL, foo, mptr);
    pthread_create(&id[1], NULL, boo, mptr);

    pthread_join(id[0], NULL);
    pthread_join(id[1], NULL);
}

I decided to sum up all the values in the memory region (obviously, the value will overflow) to prevent the CPU from doing microarchitectural optimization.

[The other Idea was to simply dereference the memory region byte by byte and load the value in RAX]

We go over the memory region repetitions times to reduce the noise within one run induced by the slightly different startup time of the threads and other processes and interrupts on the system.

Results

My machine has four physical and eight logical cores. Logical core x and x+4 are located on the same physical one (lstopo).

CPU: Intel Core i5 8250u

Running on the same logical core

Since the kernel uses PCIDs to identify TLB entries, a context switch to the other thread should not invalidate the TLBs.

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1
Running on CPU: 1
Running on CPU: 1

 Performance counter stats for './main 1 1':

        12,621,724      dtlb_load_misses.stlb_hit:u #   49.035 M/sec
             1,152      dtlb_load_misses.miss_causes_a_walk:u #    4.475 K/sec
       834,363,092      cycles:u                  #    3.241 GHz
            257.40 msec task-clock:u              #    0.997 CPUs utilized

       0.258177969 seconds time elapsed

       0.258253000 seconds user
       0.000000000 seconds sys

Running on two different physical cores

No TLB sharing or interference whatsoever.

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2
Running on CPU: 1
Running on CPU: 2

 Performance counter stats for './main 1 2':

        11,740,758      dtlb_load_misses.stlb_hit:u #   45.962 M/sec
             1,647      dtlb_load_misses.miss_causes_a_walk:u #    6.448 K/sec
       834,021,644      cycles:u                  #    3.265 GHz
            255.44 msec task-clock:u              #    1.991 CPUs utilized

       0.128304564 seconds time elapsed

       0.255768000 seconds user
       0.000000000 seconds sys

Running on the same physical core

If TLB sharing is possible, I would expect to have here the lowest sTLB hits and a low number of dTLB page walks. But instead, we have the highest number in both cases.

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5
Running on CPU: 1
Running on CPU: 5

 Performance counter stats for './main 1 5':

       140,040,429      dtlb_load_misses.stlb_hit:u #  291.368 M/sec
           198,827      dtlb_load_misses.miss_causes_a_walk:u #  413.680 K/sec
     1,596,298,827      cycles:u                  #    3.321 GHz
            480.63 msec task-clock:u              #    1.990 CPUs utilized

       0.241509701 seconds time elapsed

       0.480996000 seconds user
       0.000000000 seconds sys

Conclusion

As you can see, we have the most sTLB hits and dTLB page walks when running on the same physical core. Thus, I would follow from it that there is no sharing mechanism for the same PCID on the same physical core. Running the process on the same logical core and two different physical cores results in roughly the same amount of misses/hits to the sTLB. This further supports the thesis that there is sharing on the same logical core but not on the physical one.

Update

As suggested by Peter also use a linked-list approach to prevent THP and prefetching. The modified data is shown below.

Compile with -O0 to prevent optimization

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <inttypes.h>
#include <err.h>
#include <sched.h>
#include <time.h>
#include <sys/mman.h>

#define PAGE_SIZE 4096

const int repetitions = 1ll << 20;
const uint64_t ptrsize = 1ll<< 5;
uint64_t main_cpu, co_cpu ;

void pin_task_to(int pid, int cpu)
{
    cpu_set_t cset;
    CPU_ZERO(&cset);
    CPU_SET(cpu, &cset);
    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cset))
        err(1, "affinity");
}
void pin_to(int cpu) { pin_task_to(0, cpu); }


void *foo(void *p)
{
    pin_to(main_cpu);

    uint64_t *value;
    uint64_t *ptr = (uint64_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        value = ptr;
        for (size_t i = 0; i < ptrsize; i++)
        {
            value = (uint64_t *)*value;
        }
    }
    volatile uint64_t *dummy = value;
    pthread_exit(NULL);
}

void *boo(void *p)
{
    pin_to(co_cpu);

    uint64_t *value;
    uint64_t *ptr = (uint64_t *)p;
    printf("Running on CPU: %d\n", sched_getcpu());
    for (size_t j = 0; j < repetitions; j++)
    {
        value = ptr;
        for (size_t i = 0; i < ptrsize; i++)
        {
            value = (uint64_t *)*value;
        }
    }
    volatile uint64_t *dummy = value;
    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    if (argc < 3){
        exit(-1);
    }
    srand(time(NULL));

    uint64_t *head,*tail,*tmp_ptr;
    int r;
    head = mmap(NULL,PAGE_SIZE,PROT_READ|PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,0,0);
    tail = head;
    for (size_t i = 0; i < ptrsize; i++)
    {
        r = (rand() & 0xF) +1;
        // try to use differents offset to the next page to prevent microarch prefetching
        tmp_ptr = mmap(tail-r*PAGE_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
        *tail = (uint64_t)tmp_ptr;
        tail = tmp_ptr;
    }

    printf("%Lx, %lx\n", head, *head);
    main_cpu = strtoul(argv[1], NULL, 16);
    co_cpu = strtoul(argv[2], NULL, 16);
    pthread_t id[2];

    pthread_create(&id[0], NULL, foo, head);
    pthread_create(&id[1], NULL, boo, head);

    pthread_join(id[0], NULL);
    pthread_join(id[1], NULL);
}

Same Logical Core

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 1                                 
7feac4d90000, 7feac4d5b000
Running on CPU: 1
Running on CPU: 1

 Performance counter stats for './main 1 1':

             3,696      dtlb_load_misses.stlb_hit:u #   11.679 K/sec
               743      dtlb_load_misses.miss_causes_a_walk:u #    2.348 K/sec
       762,856,367      cycles:u                  #    2.410 GHz
            316.48 msec task-clock:u              #    0.998 CPUs utilized

       0.317105072 seconds time elapsed

       0.316859000 seconds user
       0.000000000 seconds sys

Different Physical Cores

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 2                                 
7f59bb395000, 7f59bb34d000
Running on CPU: 1
Running on CPU: 2

 Performance counter stats for './main 1 2':

            15,144      dtlb_load_misses.stlb_hit:u #   49.480 K/sec
               756      dtlb_load_misses.miss_causes_a_walk:u #    2.470 K/sec
       770,800,780      cycles:u                  #    2.518 GHz
            306.06 msec task-clock:u              #    1.982 CPUs utilized

       0.154410840 seconds time elapsed

       0.306345000 seconds user
       0.000000000 seconds sys

Same Physical Core / Different Logical Cores

> $ perf stat -e dtlb_load_misses.stlb_hit,dtlb_load_misses.miss_causes_a_walk,cycles,task-clock ./main 1 5                                 
7f7d69e8b000, 7f7d69e56000
Running on CPU: 5
Running on CPU: 1

 Performance counter stats for './main 1 5':

         9,237,992      dtlb_load_misses.stlb_hit:u #   20.554 M/sec
               789      dtlb_load_misses.miss_causes_a_walk:u #    1.755 K/sec
     1,007,185,858      cycles:u                  #    2.241 GHz
            449.45 msec task-clock:u              #    1.989 CPUs utilized

       0.225947522 seconds time elapsed

       0.449813000 seconds user
       0.000000000 seconds sys

Yeah, this is probably good. I was thinking pointer chasing through a linked list with 1 node per page (scattered to avoid transparent hugepages from changing things), but this is easier to write and probably strong enough evidence, given the single core (w. context switches) vs. separate cores showing the same dTLB miss count. Surprised you left out events like `cycles` and `task-clock`, though. (Not having touched the memory pages means they all get backed by the same physical page of zeros, so it's just TLB effects, not cache misses, so that's good all else being equal.) — Peter Cordes, May 20 '22 at 15:45
You might increment the pointer by a full page instead of reading every byte, otherwise TLB prefetch for contiguous access might be hiding some misses, depending where it prefetches into. Also don't forget to assign the sum to somewhere externally visible when you're done, like `volatile int sink = value;`, so you can compile with optimization without having it optimize away. And avoid `%` inside the loop, that will have a huge performance cost, with or without optimization. — Peter Cordes, May 20 '22 at 15:49
Thanks for the advice. I will update the solution, the results are even stronger now. Theoretically, the compiler could have optimized the `%` into a `&` but apparently, it didn't do it... — Benedict Schlüter, May 20 '22 at 16:13
Did you compile with optimization disabled (the gcc default)? I assume so or it would have removed your loops that only update a local `value` that isn't used later, with the array reads not being `volatile`. Without optimization, it won't inline or do constant-propagation across statements. Oh, and you didn't use `const` or `static const` on your global variables, so `main` can't assume their initial values, even if you did enable optimization. — Peter Cordes, May 20 '22 at 16:18
Also, you mention "sTLB hit/ dTLB miss ratio." But you're not calculating a ratio of those two things, and that wouldn't be useful. (We can see it's nearly 1:1 - almost every dTLB miss results in an sTLB hit, not an sTLB miss that causes a walk.) It's not a "rate" either, it's the total count that we're interested in. (Well, perf does show you a rate of sTLB hits, about 49M/sec.) If you wanted the dTLB miss rate, you'd need to also count `dTLB-loads` (which might be `mem_inst_retired.all_loads`, although that doesn't count mis-speculated load uops executed.) That's constant for your prog — Peter Cordes, May 20 '22 at 16:26
I'd earlier suggested a linked list as a way to defeat next-page prefetch. Another way is an LCG (Linear Congruential Generator) or other PRNG that generates every offset in the array exactly once before repeating. An LCG is nice because you can choose parameters that make it repeat a cycle-length that's just what you want, vs. taking the low bits of an xorshift could have a longer period, touching some uint64 elements twice before coming back to others. You can create a data dependency on each load by ORing the load result into the PRNG state if you want. — Peter Cordes, May 21 '22 at 04:34
A PRNG for array offsets can't scatter the pages around in non-contiguous virtual addresses like a linked list could, though. So you can't defeat transparent hugepages. Of course you can do that with kernel tuning options (https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html) or `madvise(MADV_NOHUGEPAGE)` — Peter Cordes, May 21 '22 at 04:37
Thanks for the feedback, fixes the wrong information. I tried it with a different script (use a linked list that is slightly scattered [should be enough to prevent prefetching, shouldn't it?] checked with strace and the results are sufficient in my opinion but I do not have much experience in that direction) The final observation is now even stronger. — Benedict Schlüter, May 21 '22 at 10:47
Any irregular pattern should prevent prefetching, yes. A bit of scattering also helps, although THP can kick in when some fraction of the 4k pages in a 2M region are in use, like over 90% or 95%, not necessarily every single page. Interesting that now, sharing a logical core leads to way more walks, and you're getting dTLB misses even in the fast cases. So you're now testing for lack of sharing for the sTLB, maybe not the dTLB; both ways you have about the same number of total dTLB misses. — Peter Cordes, May 21 '22 at 12:45
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Memory_Hierarchy says Skylake statically partitions all sizes for each of dTLB and sTLB. I can repro your results on my i7-6700k 4c8t, but I'm not sure wikichip is right: changing `boo` to run a `while(1){ asm("pause"); }` or `while(1){}` loop should still hurt the other thread when running on the same phys core, but my testing showed same miss counts for 1 1 vs 1 2 vs. 1 5; All about 33.5M sTLB hits, about half of what we both get when both threads are doing the same thing. (I also didn't `join` that thread before exit) — Peter Cordes, May 21 '22 at 13:05
Wikichip says the iTLB's 4k entries are supposed to be dynamically partitioned, so a large *code* working set could also be tested, perhaps by putting `mov rax, imm64` / `jmp rax` into `mmap (MAP_EXEC|MAP_WRITE)` allocations. And BTW, your code still optimizes away if compiled with `-O2`. Probably using `volatile uint64_t *` would fix that. — Peter Cordes, May 21 '22 at 13:08
I also read through the Optimization Manual (https://cdrdv2.intel.com/v1/dl/getContent/671488) and it lists the dTLD as an example for a shared HT resource (Section 2.6.1.3). Based on that information I would say Wikichip is wrong. Yes agree, you can reduce the `ptrsize` number to fit the dTLB size and than the dTLB misses are on the same logical core < 1000 and for the same phy core ~400k. But the misses which cause a walk stay the same. (Used ptrsize 32; i5-8250U) — Benedict Schlüter, May 21 '22 at 17:45
Why should `while(1){ asm("pause"); }` hurt the other thread? It only consists of two instructions within the iTLB. But you are right, we only tested it for the dTLB, but not for the iTLB. Might be interesting to see if there is a sharing mechanism. But if they have the logic for sharing Intel would have also used it on the dTLB. — Benedict Schlüter, May 21 '22 at 17:50
Also tried it with volatile before the `uint64_t` but it still gets optimized away. I add a comment that it needs to be compiled with `-O0` — Benedict Schlüter, May 21 '22 at 17:51
The thing you're dereferencing is `uint64_t *value;`. Changing it to `volatile uint64_t *value;` with any other necessary casting should do the trick. — Peter Cordes, May 21 '22 at 18:06
A loop running `while(1){ asm("pause"); }` should hurt the other thread on CPUs where dTLB or sTLB is *statically* partitioned so half the TLB is reserved for one logical core, the other half for the other, regardless of actually using it. i.e. any time you're not getting counts for `cpu_clk_unhalted.one_thread_active`, your code is running with a half-sized ROB and any other statically partitioned resources. e.g. https://www.realworldtech.com/sandy-bridge/7/ says the load and store buffers are partitioned between threads on SnB, probably meaning statically (which makes sense). — Peter Cordes, May 21 '22 at 18:11
Ahh ok, though they would be statically partitioned when hyperthreading is enabled. Not dynamically depending on the load. Thanks for the clarification. There is just one last thing that I do not understand. When I turn PCID off through `nopcid` cmdline arg I would expect that running both threads on the same logical CPU would have much more dTLB walks but this isn't the case. Form that I would follow that the kernel can keep the TLBs unflushed if the thread which is scheduled next is in the same address space, correct? (even without PCID) — Benedict Schlüter, May 21 '22 at 19:13
Not flushing the TLB makes sense on context switch between threads of the same process, but it might happen anyway with Spectre / Meltdown mitigation (changing CR3 because the kernel keeps its pages unmapped while user-space is executing). PCIDs make this less bad. Two threads running sequentially on the same logical core only swap with each other every ~10 ms or so (after using up a time slice), so they aren't stepping on each others' toes constantly like with HT, only once every few 10s of millions of clock cycles. It doesn't take too long for the TLB to refill that. — Peter Cordes, May 21 '22 at 20:59
That makes sense. Thank you very much so far. I still find it a little strange that there are (apparently) no sharing mechanisms implemented. Even if Intel holds a patent in that area (https://patents.google.com/patent/US9703566) — Benedict Schlüter, May 24 '22 at 11:51
I wonder if this could be Linux's fault via Spectre / Meltdown mitigation? Or maybe the fault of a microcode update? I'd expect it would still use the same top-level page table for each thread of a process, with the same PCID, but I'd be curious to rerun this experiment on pre-2018 microcode with a pre-2018 kernel. I don't have an easy way to do that. I have an old Haswell laptop but I lent its RAM to my brother for an even older macbook... — Peter Cordes, May 24 '22 at 13:10
I conducted all of my tests with `mds=off nospectre_v2 pti=off` but my laptop had microcode updates till last year. I have a Haswell desktop at home as well. I will test it once I am back next week. — Benedict Schlüter, May 24 '22 at 13:19
On the software side, without being more familiar with the meltdown/spectre mitigation code myself, I wouldn't be *sure* that it didn't still accidentally(?) use PCIDs in a way that prevents sharing between logical cores. If that's something that can be done safely but isn't, that would be nice to get fixed, especially if we can show a benefit on any real hardware. — Peter Cordes, May 24 '22 at 13:24
True. But after taking a closer look at the kernel code again, PCIDs are only used on a per-CPU basis. And as far as I can tell these are logical ones. However, the kernel only makes use of 6-different PCID values (with KPTI 12 different, the upper PCID bit is used to switch between kernel and userspace mappings). So I would assume there are collisions if we execute it multiple times. — Benedict Schlüter, May 25 '22 at 07:48
Great discussion. Does enabling PGE flag of CR4 register allows sharing entries among sibling hyper threads? I understand this might be disabled by default due to some TLB sharing concerns, but just wanna know if this achieves the same thing you've discussed here, or if sharing would be still blocked by the PCID thing. @PeterCordes — Mohammad Siavashi, May 12 '23 at 10:13
It should have been enabled when I conducted the experiments (it's the default for desktop systems isn't it?). According to Intel flushes of the TLB (induced by a CR3 write with the MSB set or a context switch) do not affect global entries. — Benedict Schlüter, May 12 '23 at 11:47
@MohammadSiavashi: PCIDs are only a few bits wide and their meaning is private to each *logical* core. (I think I didn't know this a year ago when I was commenting.) So the architectural guarantees they define make it impossible to share them between logical cores, I'm pretty sure. If one logical core still had some hot TLB entries tagged with one PCID when the other core the same PCID and even the same CR3 physical address of the top-level page directory (but with different entries), it's required that the new core doesn't use the stale entries. So they can't share. — Peter Cordes, May 12 '23 at 16:31
@MohammadSiavashi: I suspect even without PCIDs enabled, there might be guarantees about cores being architecturally independent that make it hard or impossible to safely share TLB entries, even if CPU architects had wanted to try. Unless maybe you have mov to CR3 on one core blow away *all* TLB entries, including ones the other core is using (so maybe worse), and/or if you tag them with the full CR3 (like 40-12 to 52-12 bits of physical page address), not just one bit for the core ID. If that would even work, it's a a lot of extra space in each TLB entry. — Peter Cordes, May 12 '23 at 16:34

Sharing a TLB entry between two logical CPUs (Intel)

1 Answers1

How I came to that conclusion

Results

Conclusion

Update