Huge number of "dTLB-load-misses" when DPDK forwarding test

Question

Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.

When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.

So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?

Hi Andiy, it is a ARMv8.1 compatible multi-core CPU. The core is not from ARM, and it is only licensed. — Polymersudo, Aug 30 '18 at 01:45
Another funny thing is that when the buffers are only used for RX, the dTLB miss number is not so huge. But when the application **forwards** the buffers (just put the buffer into the NIC TX queue without any modification of the data field), the dTLB miss number will get large significantly. — Polymersudo, Aug 30 '18 at 02:03
Well, more is not always better as you can see. Descriptors usually used as a circular buffer. So even if the app only forwards, more descriptors might not fit into the L1/L2 dTLB cache... — Andriy Berestovskyy, Aug 30 '18 at 05:42
I didn't describe it quite clearly above. I mean, if the testpmd is set to the **rxonly** mode with 4096 txd and 4096 rxd, as an example, the dTLB load miss number is not quite large compared to that with 512rxd/txd. But when the testpmd is set to the **io** _fwd_ mode with 4096txd/rxd, the miss number increases more than 100 times. — Polymersudo, Aug 30 '18 at 05:51
Still, my understanding of the issue is the same: rxonly mode uses 4096 rxds, which might fit into the CPU dTLB cache. While io mode uses two times more descriptors: 4096 rxds and 4096 tdxs. Until app's memory footprint fits into the cache -- there are basically no misses, i.e. ~0. Once the app reads/writes more than fits into the cache, suddenly there will be lots of load and store misses. A 100 or a 1000 times more -- it does not matter, since we compare misses with ~0... — Andriy Berestovskyy, Aug 30 '18 at 06:37
Ah, I think I understand a little more right now. **Many thanks**, then I will check the manual about the TLB mechanism on our CPU. — Polymersudo, Aug 30 '18 at 07:17

score 3 · Answer 1 · answered Aug 29 '18 at 17:02

3

In general, CPU TLB cache capacity depends on page size. This means that for 4KB pages and for 512MB pages there are might be different number L1/L2 TLB cache entries.

For example, for ARM Cortex-A75:

The data micro TLB is a 48-entry fully associative TLB that is used by load and store operations. The cache entries have 4KB, 16KB, 64KB, and 1MB granularity of VA to PA mappings only.

Source: ARM Info Center

For ARM Cortex-A55:

The Cortex-A55 L1 data TLB supports 4KB pages only. Any other page sizes are fractured after the L2 TLB and the appropriate page size sent to the L1 TLB.

Source: ARM Info Center

Basically, this means that the 512MB huge page mappings will be fractured to some smaller size (down to 4K) and only those small pieces will be cached in L1 dTLB.

So even if your application is fit into a single 512MB page, still the performance will depend greatly on actual memory footprint.

answered Aug 29 '18 at 17:02

Andriy Berestovskyy

8,059
3
17
33

Now I am a little clear about this, and it seems that the TLB is a little different from that on MIPS based CPUs. I need to check the datasheet and check the behavior of the dTLB. It is quite possible the behavior of TLB is compliant. Thanks a lot – Polymersudo Aug 30 '18 at 01:48
Yeah, ARMs are so different, so the datasheet is the best place to confirm. You should also check to which hardware register the “dTLB-load-misses” is mapped to. And again check what exactly that register counts in the datasheet... – Andriy Berestovskyy Aug 30 '18 at 05:50
Thanks, I will ask our kernel team for the manual. Currently, I only get several simple slides about this... – Polymersudo Aug 30 '18 at 05:57

Huge number of "dTLB-load-misses" when DPDK forwarding test

1 Answers1