2

I am trying to debug a issue related to packet loss when using DPDK. When using the application without DPDK, there is no issue seen.

To explain: I have a process A which receives packets from process B (from different server).

Initial issue: When DPDK is enabled in process A, for first few seconds, the packet flow is fine, however after few minutes the process A stops receiving any packets. What can be possible reason for this ? I have confirmed packets are being sent by process B.

To debug this: I have enabled pdump feature in my application so that I can take packet capture using dpdk-pdump. While debugging, I see that, the server is receiving packets when I check using dpdk-proc-info

[root@QVr740-6 app]# ./dpdk-proc-info   -- --stats -p 0x1
EAL: Cannot find resource for device
EAL: No legacy callbacks, legacy socket not created

  ######################## NIC statistics for port 0  ########################
  **RX-packets: 11595973**    RX-errors:  0           RX-bytes:  17231595358
  RX-nombuf:  0
  TX-packets: 0           TX-errors:  0           TX-bytes:  22

  ############################################################################

However, when I take try taking packet capture :

[root@QVr740-6 app]# ./dpdk-pdump -l 42,44,46  --   --pdump 'device_id=0000:18:00.1,queue=*,rx-dev=/home/cu1/nmurshed/capture.pcap'
EAL: Detected 56 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_69588_2a3baabe32a56
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:18:00.1 (socket 0)
EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:18:00.2 (socket 0)
EAL: Cannot find resource for device
EAL: No legacy callbacks, legacy socket not created
Port 2 MAC: 02 70 63 61 70 01
 core (42), capture for (1) tuples
 - port 0 device (0000:18:00.1) queue 65535
^C

Signal 2 received, preparing to exit...
##### PDUMP DEBUG STATS #####
 -packets dequeued:                     0
 -packets transmitted to vdev:          0
 -packets freed:                        0

How to find out where these packets are dropping ? I did confirm that dpdk-pdump works when issue is not seen.

Any hints will be valuable as I have been tearing my hair on this.

EDIT:

I missed something in the stats. I see that Rx-missed_errors keep increasing at an alarming rate when the issue occurs.

Wed Oct 20 18:47:46 PDT 2021
rx_missed_errors: 0
Wed Oct 20 18:47:47 PDT 2021
rx_missed_errors: 0
Wed Oct 20 18:47:48 PDT 2021
rx_missed_errors: 0
Wed Oct 20 18:47:49 PDT 2021
rx_missed_errors: 8216
Wed Oct 20 18:47:50 PDT 2021
rx_missed_errors: 32384
Wed Oct 20 18:47:51 PDT 2021
rx_missed_errors: 56510
Wed Oct 20 18:47:52 PDT 2021
rx_missed_errors: 80636
Wed Oct 20 18:47:53 PDT 2021
rx_missed_errors: 104762
Wed Oct 20 18:47:54 PDT 2021
rx_missed_errors: 128882
Wed Oct 20 18:47:55 PDT 2021
rx_missed_errors: 152960
Wed Oct 20 18:47:56 PDT 2021
rx_missed_errors: 177086
Wed Oct 20 18:47:57 PDT 2021```

I increased the rx/tx desc in  rte_eth_rx_queue_setup which delays the problem. Somehow, my application is not freeing the rx_desc.

Question.. is each packet received == 1 rx_desc?
Is it possible that my application takes too long time to process packet ? or is it like I am not freeing them ?

nmurshed
  • 77
  • 6
  • what do you mean by without dpdk the packet flow is fine, who is receiving packets without dpdk? – Effie Oct 19 '21 at 09:06
  • What I mean is.. I have a way to build process A without DPDK.. so dpdk is not in the picture then. – nmurshed Oct 19 '21 at 13:11
  • @numrshed please add the compile flags (static or shared) mode for DPDK, a snippet of DPDK API calls in process A and the arguments used for rte_eal_init. From the current question explained `you are referring to packet drop as process B (pdump) not receiving packets`. is this the right understanding? – Vipin Varghese Oct 20 '21 at 03:01
  • Hi @vipin, I missed the rx_missed_errors counter... Initially the counter is 0..then it starts increasing.. which explains the drop Wed Oct 20 18:47:48 PDT 2021 rx_missed_errors: 0 Wed Oct 20 18:47:49 PDT 2021 rx_missed_errors: 8216 Wed Oct 20 18:47:50 PDT 2021 rx_missed_errors: 32384 Wed Oct 20 18:47:51 PDT 2021 rx_missed_errors: 56510 Wed Oct 20 18:47:52 PDT 2021 rx_missed_errors: 80636 Wed Oct 20 18:47:53 PDT 2021 rx_missed_errors: 104762 – nmurshed Oct 21 '21 at 01:53
  • Yes, correct.. my application itself doesn't receive packets ..which I now believe is due to rx_missed_errors.. Increasing the rx_desc increases the time it takes for the issue to happen.. but I guess need to root cause why the fd's are not enough... any hints on what to look for in my application will be helpful – nmurshed Oct 21 '21 at 01:56
  • @nmurshed this is good observation, I am open to talk with you over skype, zoom, google meet. Let me know if it's useful for a live debug. – Vipin Varghese Oct 21 '21 at 03:15
  • @Vipin we can connect. Unfortunately, I won’t be able to divulge a lot of details about the application as its work related. How can I connect with you. – nmurshed Oct 21 '21 at 12:02
  • You can reach me on skype, zoom or google meet. If you can reproduce with standard application l2fwd, skeleton so we can debug – Vipin Varghese Oct 21 '21 at 13:24
  • Not able to repro.. infact.. using the same binaries.. it's not reproduced on another server.. only diff is sender process is different. One thing I can see from capture is.. in the problematic setup.. there are lot of fragmentation .. the packet len is 9000 bytes.. fragmented to 1440. Not sure if that is playing a role...but application should handle it. – nmurshed Oct 21 '21 at 13:48
  • Can't find vipinpv85 in google meet.. maybe join https://meet.google.com/imf-fjqo-ozk – nmurshed Oct 21 '21 at 14:32
  • got disconnected : https://meet.google.com/imf-fjqo-ozk – nmurshed Oct 21 '21 at 16:26

1 Answers1

0

DPDK counter rx_missed_errors infers these many packets were not processed which were received on the NIC. While Rx-no-mbuf represents the counters which showcase packets that were not DMA to CPU-Memory due to the absence of MBUF buffers. Hence the error is mostly in application logic either Spending too much time processing the packets or recursive processing on the same MBUF array after rx_burst.

[EDIT-1] Based on a couple of debugging attempts and pointers the issue is root caused to application logic. Summarizing the below

  1. For incoming ARP requests the packets are processed and ARP reply is sent out on the same MBUF and rte_pktmbuf_free we called immediately right after rte_Eth_tx_burst - Issue Fixed
  2. For IP packets, the IP header and UDP header are processed for the desired packet and necessary changes are made to the MBUF before transmission. - for certain conditions (count of packets) the logic enters to longer loops which stalls the function exit.

Note:

  1. Fixing the above 2 issues seems to solve the issue.
  2. Using DPDK-Pktgen to generate custom packets allows to narrow down specific code areas.
Vipin Varghese
  • 4,540
  • 2
  • 9
  • 25
  • Thanks a lot for help @Vipin. It was very nice talking to you. I never thought, you would give so much time. Thank you so much for sharing all the knowledge. Coudn't have done it without your help – nmurshed Oct 23 '21 at 02:54
  • Happy to help and share, best wishes for your project Niyaz – Vipin Varghese Oct 23 '21 at 02:58
  • Even after fixing the code loop, the issue persists somewhat.. instead of the loop I get the below error and then the i assume the packet is messed up.. i40e_dev_alarm_handler(): ICR0: malicious programming detected i40e_handle_mdd_event(): Malicious Driver Detection event 0x02 on TX queue 1 PF number 0x01 VF number 0x00 device 0000:18:00.1 i40e_handle_mdd_event(): TX driver issue detected on PF – nmurshed Oct 25 '21 at 14:04
  • @nimurshed, Maclicious driver is result of sending (TX) incorrect packet descriptor. It is not result of RX – Vipin Varghese Oct 25 '21 at 15:21
  • okay.. so the problem might still be somwhere messing up the buffer when reading the incoming packets.. But one thing is resolved that I now receive the packet always. – nmurshed Oct 25 '21 at 15:48