What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

Question

Consider the following loop:

.loop:
    add     rsi, STRIDE    
    mov     eax, dword [rsi]
    dec     ebp
    jg .loop

where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page.

I've ran this code for all possible strides in the range 0-8192. The measured number of minor and major page faults is exactly 1 and 0, respectively, per page accessed. I've also measured all of the following performance events on Haswell for all of the strides in that range.

DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK: Misses in all TLB levels that cause a page walk of any page size.

DTLB_LOAD_MISSES.WALK_COMPLETED_4K: Completed page walks due to demand load misses that caused 4K page walks in any TLB levels.

DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M: Completed page walks due to demand load misses that caused 2M/4M page walks in any TLB levels.

DTLB_LOAD_MISSES.WALK_COMPLETED_1G: Load miss in all TLB levels causes a page walk that completes. (1G).

DTLB_LOAD_MISSES.WALK_COMPLETED: Completed page walks in any TLB of any page size due to demand load misses

The two counters for the the hugepages are all zero for all the strides. The other three counters are interesting as the following graph shows.

For most strides, the MISS_CAUSES_A_WALK event occurs 5 times per page accessed and the WALK_COMPLETED_4K and WALK_COMPLETED events each occurs 4 times per page accessed. This means that all the completed page walks are for 4K pages. However, there is a fifth page walk that does not get completed. Why are there so many page walks per page? What is causing these page walks? Perhaps when a page walk triggers a page fault, after handling the fault, there will be another page walk, so this might be counted as two completed page walks. But how come there are 4 completed page walks and one apparently canceled walk? Note that there is a single page walker on Haswell (compared to two on Broadwell).

I realize there is a TLB prefetcher that appears to only be capable of prefetching the next page as discussed in this thread. According to that thread, the prefetcher walks do not appear to be counted as MISS_CAUSES_A_WALK or WALK_COMPLETED_4K events, which I agree with.

These seems to be two reasons for these high event counts: (1) a page fault causes the instruction to be re-executed, which causes a second page walk for the same page, and (2) multiple concurrent accesses that miss in the TLBs. Otherwise, by allocating memory with MAP_POPULATE and adding an LFENCE instruction after the load instruction, one MISS_CAUSES_A_WALK event and one WALK_COMPLETED_4K event occur per page. Without LFENCE, the counts are a little larger per page.

I tried with having each load accesses an invalid memory location. In this case, the page fault handler raises a SIGSEGV signal, which I handle to allow the program to continue executing. With the LFENCE instruction, I get two MISS_CAUSES_A_WALK events and two WALK_COMPLETED_4K events per access. Without LFENCE, the counts are a little larger per access.

I've also tried with using a prefetching instruction instead of a demand load in the loop. The results for the page fault case are the same as the invalid memory location case (which makes sense because the prefetch fails in both cases): one MISS_CAUSES_A_WALK event and one WALK_COMPLETED_4K event per prefetch. Otherwise, if the prefetch is to a location with a valid in-memory translation, one MISS_CAUSES_A_WALK event and one WALK_COMPLETED_4K event occur per page. Without LFENCE, the counts are a little larger per page.

All experiments were run on the same core. The number of TLB shootdown interrupts that have occurred on that core is nearly zero, so they don't have an impact on the results. I could not find an easy way to measure the number of TLB evictions on the core by the OS, but I don't think this is a relevant factor.

The Spike

Also as the graph above shows, there is a special pattern for small strides. In addition, there is a very weird pattern (spike) around stride 220. I was able to reproduce these patterns many times. The following graph zooms in on that weird pattern so you can see it clearly. I think the reason for this pattern is OS activity and not the way the performance events work or some microarchitectural effect, but I'm not sure.

The Impact of Loop Unrolling

@BeeOnRope suggested to place LFENCE in the loop and unroll it zero or more times to better understand the effect of speculative, out-of-order execution on the event counts. The following graphs show the results. Each line represents a specific load stride when the loop is unrolled 0-63 times (1-64 add/load instruction pairs in a single iteration). The y-axis is normalized per page. The number of pages accessed is the same as the number of minor page faults.

I've also run the experiments without LFENCE but with different unrolling degrees. I've not made the graphs for these, but I'll discuss the major differences below.

We can conclude the following:

When the load stride is about less than 128 bytes, MISS_CAUSES_A_WALK and WALK_COMPLETED_4K exhibit higher variation across different unrolling degrees. Larger strides have smooth curves where MISS_CAUSES_A_WALK converges to 3 or 5 and WALK_COMPLETED_4K converges to 3 or 4.
LFENCE only seems to make a difference when the unrolling degree is exactly zero (i.e., there is one load per iteration). Without LFENCE, the results (as discussed above) are 5 MISS_CAUSES_A_WALK and 4 WALK_COMPLETED_4K events per page. With the LFENCE, they both become 3 per page. For larger unrolling degrees, the event counts increase gradually on average. When the unrolling degree is at least 1 (i.e., there are at least two loads per iteration), LFENCE makes essentially no difference. This means that the two new graphs graphs above are same for the case without LFENCE except when there is one load per iteration. By the way, the weird spike only occurs when the unrolling degree is zero and there is no LFENCE.
In general, unrolling the loop reduces the number of triggered and completed walks, especially when the unrolling degree is small, no matter what the load stride is. Without unrolling, LFENCE can be used to essentially get the same effect. With unrolling, there is no need to use LFENCE. In any case, execution time with LFENCE is much higher. So using it to reduce page walks will significantly reduce performance, not improve it.

By "page faults" you mean that you didn't have any warmup phase? what happens if you add one, but make sure that the data-set is big enough to thrash all TLB levels? — Leeor, Oct 01 '18 at 12:59
@Leeor The buffer size is big enough to require thousands of TLB entries for all strides except stride 0, which requires only one TLB entry since only a single page is accessed in this case. But in any case, there is never any TLB threshing since I'm performing all accesses to the same page before any accesses to any other page. So there should be really a single TLB miss per page. Also different experiments should not affect each other because the OS will reclaim all physical pages and invalidate the TLB entries when a process terminates. — Hadi Brais, Oct 01 '18 at 19:58
The weird patterns at small strides and stride 220 suggests that there is something I'm missing though. — Hadi Brais, Oct 01 '18 at 19:59
What if you measure separately user and kernel counts? The kernel could be responsible for some of the misses. — BeeOnRope, Oct 13 '18 at 23:49
Keep in mind that if you have meltdown mitigation in your kernel, you'll probably effectively clear the TLB when you transition to kernel mode, so your results can be explained by the kernel simply taking say 3 or 4 TLB misses, and user mode taking 1 or 2 (depending on how the "restart" after the page-fault is handled). This would repeat each call into the kernel since Meltdown mitigations prevents their caching. — BeeOnRope, Oct 14 '18 at 03:32
@BeeOnRope The results I've shown are only for user-mode counts. So the 5 `MISS_CAUSES_A_WALK` events per iteration and the 4 `DTLB_LOAD_MISSES.WALK_COMPLETED_4K` per iteration are presumably occurring in user mode. This includes the transition from user to kernel, but probably not for kernel to user. But good point about the meltdown mitigation. I'll try to disable it and see if the results change. — Hadi Brais, Oct 14 '18 at 05:40
Yeah it's hard to see where the 4/5 events come from then. As you suggest maybe it's some speculative behavior where multiple accesses are being considered in parallel, and then get "cancelled" by the page fault and start over after. What happens if you make the loads dependent? What happens if you put an lfence in the loop? — BeeOnRope, Oct 14 '18 at 16:25
@BeeOnRope When I use lfence, both `MISS_CAUSES_A_WALK` and `WALK_COMPLETED_4K` become 3 per page. Also the weird spike for strides 128-256 is gone. Note that I said in my earlier comment "by iteration" by mistake. The numbers are per page. — Hadi Brais, Oct 14 '18 at 18:03
Maybe try turning off hardware prefetching to see if that's interacting here. The PF behavior is often stride-dependent. — BeeOnRope, Oct 14 '18 at 18:19
@BeeOnRope Hardware prefetching has no impact on the page walk events. — Hadi Brais, Oct 14 '18 at 18:28
@BeeOnRope I've rerun the experiments with PTI disabled by adding `nopti nospectre_v2 nospec` to the grub kernel command line. I got the same results. — Hadi Brais, Oct 15 '18 at 05:54
@HadiBrais - so it seems the best we can say now is that speculative/OoO execution was responsible for 2 `MISS_CAUSES_A_WALK` and 1 `WALK_COMPLETED_4K` events events. That in itself is interesting: you'd expect maybe 1/0 or 0/0: the walks from speculative execution should never "complete" because they will be cancelled by the page-fault of the earlier access, and with only one hardware page walker it seems like maybe they wouldn't even start - but it is reasonable that perhaps the `MISS_CAUSES_A_WALK` event would increment once the miss is detected, even if the walk is waiting in ... — BeeOnRope, Oct 17 '18 at 01:34
... line for the page walker hardware. However, 2/1 is pretty weird. Perhaps the because fault is taken at retirement, there is time for additional walks to start and complete. You could test by unrolling the loop by 2 and putting an `lfence` only between every 2 loads rather than 1 load, to see if the 2/1 is coming from more than additional load beyond the oldest. After all that, we are still left with the 3/3 events even in the case with `lfence`. It is weird, you'd expect at most 2/2: one walk before and one walk after the fault. — BeeOnRope, Oct 17 '18 at 01:36
@BeeOnRope Good idea! Prelimineary results for up to 16 load before the lfence show that the counts for both `WALK_COMPLETED_4K` and `MISS_CAUSES_A_WALK` gradually increase to up to 3.6 and 4.4, respectively. I'll try to generalize my code so that I can do this for an arbitrary number of loads before the lfence and post the results in a graph format. — Hadi Brais, Oct 17 '18 at 02:55
My theory with the "spike" is that something special happens for strides that "just" make it into the next page during speculative execution before the page fault "hits". For example, the CPU tends to be able to run ahead ~18 additional loads before the fault stops user execution, then you'd just about get into the next page. Maybe this causes another walk event, that you don't see at smaller strides because the second miss doesn't happen. Then for longer strides perhaps there is enough time to actually cache the walk result somehow so it doesn't need to happen again (kind of unlikely!). — BeeOnRope, Oct 17 '18 at 03:56
About the pattern for small strides: I can't really tell because the line in the graph is thick and kind of obscures the pattern, but maybe it has something to do with split-page loads? You are loading a `DWORD` so for small strides like 1, 3 or 5 split loads will be common. Perhaps such loads increment the counters twice as much as would otherwise be expected? — BeeOnRope, Oct 17 '18 at 03:59
@BeeOnRope I think my earlier observation that lfence only makes a difference when there is one load per iteration is not accurate. I'll make the graphs for the case where there is no lfence and add them. — Hadi Brais, Oct 18 '18 at 03:14
About the theory "multiple concurrent accesses to the same translation that doesn't exist in the STLB": if that was a big source of extra events (e.g, the reason it goes from 2 to 4), wouldn't the number of accesses drop as you approach stride >= 4096? At stride 4096 there is only 1 access per page, so there should not be concurrent access to the same translation. — BeeOnRope, Aug 01 '19 at 00:59
@BeeOnRope Good point, Bee. Even for stride 4096, without lfence and with or without `MAP_POPULATE`, the events `CAUSES_A_WALK`, `WALK_COMPLETED_4K`, and `WALK_COMPLETED` all overcount. I think "same translation" should be removed. — Hadi Brais, Aug 01 '19 at 10:32
I am a beginner in computer system performance. I am a master's student. Please can anyone guide me, on how to use the above commands, for example `dtlb_load_misses.walk_active` via `perf stat -e` command. I get an `event syntax error`. Also the `perf list` does not list this exact command. All it has is `dTLB-load-misses` and other events. Please can anyone help me. — Abhishek Ghosh, Nov 03 '22 at 21:27

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

0 Answers0

Linked