Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.
When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf
command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.
So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?