High L3 Cache misses affecting (dpdk) pkt forwarding in AMD Cpus

Question

I am Seeing a very high L3 cache misses on AMD while running DPDK based forwarding/routing applications. My application consists of an Pkt Poll Thread (say P1) and two Worker Threads W1 and W2. P1 polls the nic and sprays packets to W1 or W2. The Worker does fixed packet jobs and send it back to P1 for transmit back. on AMD 7702 i am not able to cross 22Mpps and on AMD 7542 its just 15Mpps. Compare this to Intel Xeon 6248R, for the same application, i can get ~40Mpps. The NiC here mellanox ConnectX-5 dual port 100Gbps.

Also i am seeing L3 cache misses even if i simply drop packets at Rx itself. So we do packet rx and immediately free all of them in the Pkt Poll thread. Even then i am noticing very high L3 cache misses. This is in the same thread by the way. I have even tried running testpmd but dont get beyond 35Mpps. The numbers in this doc look quite overwhelming in comparison (although the hardware is different).

https://fast.dpdk.org/doc/perf/DPDK_19_08_Mellanox_NIC_AMD_performance_report.pdf

You know your [Epyc 7702](https://en.wikichip.org/wiki/amd/epyc/7702) (Zen2)'s L3 cache is made up of 16x 16MiB separate caches, right? (So each CCX of 4 cores only shares a 16 MiB L3, and has worse bandwidth to cores in other CCxs.) See https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core_Complex. An Intel CPU would have a single L3 shared between all cores, which might be better for your application, especially if it isn't optimized for locality within a CCX, and trying to schedule threads within the CCX or not. — Peter Cordes, Jun 05 '22 at 02:37
@Ravi, I have been using both Rome and MIlan with CX-5 I highly recommend to check https://stackoverflow.com/questions/72345569/why-does-dpdk-mellanox-connectx5-process-128b-packets-much-faster-than-other-s/72393527#72393527. Please setup a call for further clarification. Tip just replace MLX05 with Intel or Broadcom NIC and check the difference — Vipin Varghese, Jun 06 '22 at 03:39
@PeterCordes I know the AMD has CCX arch and i even tried to pin the application to the cores on the same CCX. I tried playing with few BIOS settings and i noticed that if i disable memory interleaving, then my same thread performance jumps but my multi thread performance still stays the same. All threads of my application are part of the same CCD. My point here is from the link i pasted in the OP which provides numbers of AMD ZEN2. — Ravi, Jun 06 '22 at 20:35
@Ravi, did you check https://stackoverflow.com/questions/72345569/why-does-dpdk-mellanox-connectx5-process-128b-packets-much-faster-than-other-s/72393527#72393527? As requested can you also update the sample code for test? — Vipin Varghese, Jul 11 '22 at 11:29
@Ravi is there any updates from your end. can you share details of your test. With 100Gbps port on 64B I am able to achieve 148Mpps with MLX5 PMD — Vipin Varghese, Jul 12 '22 at 16:22

High L3 Cache misses affecting (dpdk) pkt forwarding in AMD Cpus

0 Answers0