Why does DPDK + mellanox connectx5 process 128B packets much faster than other sizes packets, when ruuning an I/O intensive application?

Question

For my measurements, there are two machines, one as client node(Haswell),the other one as server node(Skylake),and both nodes with the NIC,mellanox connect5. client sends packets to the server at a high rate(Gpps), and a simple application -- L2 forwarding, running on the server node with 4096 RX descriptors. I have sent many sizes of packets(64B,128B,256B,512B,1024B,1500B) ,however I get a interesting result. When I send the 128B packets, the latency(both LAT99 and LAT-AVG) is much better than other sizes packets.

There are my measurements results below:

packet size	THROUGHPUT	PPS	LAT99	LATAVG
64B	14772199568.1	20983238.0228	372.75	333.28
128B	22698652659.5	18666655.1476	51.25	32.92
256B	27318589720	12195798.9821	494.75	471.065822332
512B	49867099486	11629454.1712	491.5	455.98037273
1024B	52259987845.5	6233300.07701	894.75	842.567256665
1500B	51650191179.9	4236400.1952	1298.5	1231.18194373

some settings and configuration： sudo mlxconfig -d 0000:18:00.1 q

The server node(Skylake) is enable DDIO,so the packets are sent directly to the L3 cache. The latency gap between 333.28 and 32.92 is similar to the gap between L1 cache and L3 cache. So, I guess it might be due to L1 prefetch. L1 cache prefetches better, when receives 128B packets than other size packets.

My question:1.Is my guess correct? 2.Why is it faster to process 128B packets, is there any specific L1 prefetch strategy that can explain this result? 3. If my guess is wrong, what is causing this phenomenon?

with my extensive testing of Mellanox NIC on Both AMD and Intel platform I can easily confirm DDIO has nothing to do with the 128B performance. Please update your question on the PMD arguments passed with number of RX queues and core pinning to help you more. — Vipin Varghese, May 23 '22 at 09:44
Thanks for your answers. I have pinned the L2 forwarding to core 0(only one core, only one RX queue). — xuxing chen, May 23 '22 at 10:19
Are you using PMD args. On AMD MILAN we get 40Mpps (with no pmd args) and Intel Icelake we get 39.8Mpps (with no pmd args) for 64B, 128B etc. Looks like pause or relax ordering are not enabled hence HW drops in NIC. — Vipin Varghese, May 23 '22 at 10:57
At least the maximum that a single RX queue from MLX-5 and MLX-6 is 40Mpps and with other vendor I am able to get 68Mpps with 1 RX queue. So, it definitely looks like configuration issue, post the ethtool statistics for better debug. For HW debug use `Neo Host` tool. — Vipin Varghese, May 23 '22 at 16:18
all my testing were focused on 100Gbps and 200Gbps with mellanox cx-5, and cx-6 — Vipin Varghese, May 23 '22 at 16:22
I am sorry for my mistake. In my test, I get 18Mpps(18666655.1476) not the 18Gpps for the 128B packets. However, my goal is not to increase my throughput or pps. My purpose is to find if it is really faster to process 128B packets than other packets (the gap is shown in my table above ) and why. So, did you find the same phenomenon(128B packets latency is better) with your test in 40Mpps? — xuxing chen, May 24 '22 at 02:15
My application ruuning in the server node is L2 forwarding, a simple I/O-intensive application. — xuxing chen, May 24 '22 at 02:39
`My purpose is to find if it is really faster to process 128B packets than other packets`. You have not shared any BIOS aor Kernel settings. So please share the BIOS, kernel, NIC firmware, ethtool settings, pci settings for the environment with `mlxconfig -d [pcie address] q`. Based on the table shared 64B has 20Mpps, 128B has 18Mpps, 256B has 12Mpps so I do not see anything wrong except the configuration issue on your platform. I do not know how are you calculating the latency is this for zero packet drop scenario with IXIA or Spirent? — Vipin Varghese, May 24 '22 at 03:03
I really appreciate your answer. And I have shared some settings following your guide, mlxconfig -d [pcie address] q. About calculating the latency, I use the fastclick(https://github.com/tbarbette/fastclick) to generate，transmit and calculate the latency and throughput, some computational details are hidden by this framework. — xuxing chen, May 24 '22 at 04:26
I have updated with respect to MLX-5 foundational NIC. Hope this information and result help you. if it does please accept and upvote to close the question and help others. — Vipin Varghese, May 26 '22 at 14:37

Vipin Varghese · Accepted Answer · 2022-07-04T11:53:12.077

@xuxingchen there are multiple questions and clarifications required to address the questions. So let me clarify step by step

Current setup is listed as Mellznox Connectx 5, but mlxconfig states it is DPU. DPU has internal engine and Latency will be different foundational NIC from Mellanox such as MLX-4, MLX-5, ConnectX-6.
PCIe read size is recommended to be updated to read size of 1024
It is mentioned as SKYLAKE which has PCIe gen 3.0, but mlxconfig reports PCIe gen4.0 as connection
CQE compressed is balanced, but recommended setting (even for vector mode) is aggressive
For DDIO to work the PCIe device (firmware) needs TPH (TLP processing hints) activated to allow Steering tag to be populated from user space to NIC firmware. In Intel NIC there are code in DPDK PMD to achieve the same.
In case of Mellanox, I do not find the TPH enabling code in PMD. Hence I have to speculate the if the DPU NIC support DDIO, it might be through driver tag steering via MSIX interupts pinned to CPU core. For this one needs to disble irqaffinity of the current NIC, and allow pinning all the interrupts to specific cores (other than DPDK).

With these my recommendations for the right settings (only foundation NIC CX-5, CX-6 and not DPU since I have not tested) are

systemctl stop irqbalance.service
systemctl disable irqbalance.service
systemctl stop wpa_supplicant
systemctl disable wpa_supplicant
./set_irq_affinity_cpulist.sh [non dpdk cores] [desired NIC]
mlxconfig -d [pcie device id] set SRIOV_EN=0
mlx_tune -r
ifconfig [NIC] txqueuelen 20000
ethtool -G [NIC] rx 8192 tx 8192
ethtool -A [NIC] rx off tx off
mlxconfig -d [pcie address] set ZERO_TOUCH_TUNING_ENABLE=1
mlxconfig -d [pcie address] set CQE_COMPRESSION=1
mlxconfig -d [pcie address] s PCI_WR_ORDERING=1

With the above settings and settings from the performance report with MLX-5 foundational NIC, I am able to achieve on AMD EPYC following result

[EDIT-1] based on the comment, there is an incorrect assumption that CPU is the bottleneck for fewer packets per second per queue. To prove it is no CPU or platform issue, same test is run with multiple Mellanox with 1 CPU (that is 1 RX queue per 2 ports)

note: with other vendors NIC (Intel & Broadcom) one can easily achieve 68MPPs and 55MPPs with 1 port 1 rx-queue respectively.

using multiple RX queue on the same CPU core we can achieve higher MPPs (show casing individual RX queue is limiting factor on MLX)

Please note the number collect is for 1 RX queue with 1 CPU therad — Vipin Varghese, Jun 06 '22 at 03:47
Why can't small packets reach the packet rate of 100? eg, the 128B packet can only reach 38, but the 1024B packet can reach the 100. — xuxing chen, Jun 21 '22 at 02:37
@xuxingchen at least with the investigation using Mellanox NEO-HOST tools, it looks like there is HW limitation in the NIC embedded switch which prevents from putting more than 35 ~38 MPPs per RX queue for Mellanox NIC CX-5 (100Gbps), CX-6 (100Gbps). and CX-6 200Gbps. — Vipin Varghese, Jun 21 '22 at 03:18
I doubt that the bottleneck is on the CPU side rather than the NIC side. The CPU cannot handle such fast packet traffic, resulting in the value of pkt rate being less than 100. — xuxing chen, Jun 21 '22 at 08:11
@xuxingchen your doubt is incorrect, as mentioned based on the analysis with Neo host tool for mellanox nic it is clear the problem is on Mellanox CX-5 and CX-6 NIC. Same processor and pcie slot for Intel and Broadcomm can do 68Mpps and 52Mpps per RX queue. — Vipin Varghese, Jun 21 '22 at 08:51
I used the `mlnx_tune` command to check the status of the Mellanox BlueField Device. It warns that "PCI width status is below PCI capabilities", which is the same as your analysis. However, it recommended me to check PCI configuration in BIOS. How can I configure BIOS to solve it? — xuxing chen, Apr 05 '23 at 07:12

Why does DPDK + mellanox connectx5 process 128B packets much faster than other sizes packets, when ruuning an I/O intensive application?

1 Answers1

Linked