full cache linux cause drop at nic

Question

I have a dpdk 19 application and read from nic(MT27800 Family [ConnectX-5] 100G) with 32 rx multiqueue with RSS .

So there are 32 processes that receive traffic from nic with dpdk, Each process read from a different Queue, copy from the mbuf the data to allocated memory, accumulate to 6MB and send it to another thread via a lockless Queue, that other thread only write the data to disk. As a result I/O write is cached in linux memory.

All processes run with cpu affinity, there is isolcpus in the grub

This a little pseudo code of what happen in each of the 32 processes that read from its Queue, i can't put the real code, it is too much

MainFunction()
{
   char * local_buf = new...
   int nBufs = rte_eth_rx_burst(pi_nPort, pi_nQNumber, m_mbufs, 216);
   for(mbuf in m_mbufs)
   { 
       memcpy(local_buf+offset, GetData(mbuf),len);//accumulate to buf
       if(local_buf.len > MAX)
       {
          PushToQueue(local_buf);
          local_buf = new ...
       }
       rte_pktmbuf_free(mbuf);
   }
}

WriterThreadMainFunc
{
     While(QueueNotEmpty)      
     {
          buf = PullFromQ
          WriteToDisk(buf)
          delete buf;
     }

}

When the server memory is completely cache ( I know it is still available) I start seeing drops at nic.

If I delete the data from disk every minute the cached memory is released to free and and no drops at nic. So the drops are clearly linked to the cached data. Until the first drops the application can receive run without drops for 2 hours. The process don't use much memory each process is at 500 MB.

How can I avoid the drops at nic?

               total        used        free      shared  buff/cache   available
Mem:           125G         77G        325M         29M         47G         47G
Swap:          8.0G        256K        8.0G

I use Centos 9.7 linux 3.10.0-1160.49.1.el7.x86_64.

please add the code snippet to better understand the application and threading model. It is also not clear whether rx_burst from NIC is followed up with write to data in disk. Please also read how to ask good question too — Vipin Varghese, Jan 19 '22 at 14:38
@VipinVarghese I updated the question, hope it is clear now. Sorry but i can't put full code here — yaron, Jan 19 '22 at 15:08
@davidboo the problem described is due to disk content not flushed periodically but held page (4KB) by vfs. This is causing your memory to decrease. DPDK uses huge pages (on x86 2MB and 1GB) I humbly request to fix the writing to disk (this is not DPDK issue). — Vipin Varghese, Jan 19 '22 at 16:11
@VipinVarghese I didn't understand the proposition, it run 2 hours before the first drops. The available memory is 47 G. What do i need to do different? — yaron, Jan 19 '22 at 16:24
@davidboo I did not propose anything. I pointed out why your `disk content not flushed periodically but held page (4KB) by vfs` since you are writing to disk, So can you please rephrase your question for me? — Vipin Varghese, Jan 19 '22 at 16:30
@VipinVarghese How do you know disk content not flush periodically? You wrote "I humbly request to fix the writing to disk". The fact that the cache grow does not mean it is not written to disk. — yaron, Jan 19 '22 at 16:36
@davidboo `How do you know disk content not flush periodically?` from my experience working with similar problems. `The fact that the cache grow does not mean it is not written to disk` surely this is misunderstanding, you can simply request the flush of the content to Harddisk and cross check `instead of deleting cache file`. — Vipin Varghese, Jan 19 '22 at 16:39
@VipinVarghese You mean in c to call the fsync function after closing the fd https://linux.die.net/man/2/fsync — yaron, Jan 19 '22 at 16:44
@davidboo as you have not mentioned what is your code or shared code snippet, I am taking an educated guess you are not flushing at the appropriate intervals. For testing the theory please try using https://stackoverflow.com/questions/9551838/how-to-purge-disk-i-o-caches-on-linux — Vipin Varghese, Jan 19 '22 at 17:01

score 0 · Answer 1 · answered Jan 26 '22 at 14:10

DPDK API rte_eth_rx_burst uses the mempool (or pktmbuf) memory region to hold the metadata and ethernet frame. In each rx_burst cycle internally

it checks for local cached mempool object for pkt_mbuf to DMA from physical NIC
If local cache mbuf are not found, it acquires the mempool lock and gets the mbuf from the mempool.
All mbuf are marked with ref_cnt as 1, to indicate the mbuf is in use and not to free.
Unless a tx_burst or rte_mbuf_free is invoked the mbuf is never pushed to local cache or mempool for reuse.

Hence as shared in the code snippet, the performance of WriterThreadMainFunc affects the availability of mempool. that is, if speed of rx_burst (Million Packet per Sec) is greater than

function PullfromQueue
or function WriteToDisk
or both functions

this will leads to the scenario mbuf_free is slower than rx_burst. To validate the same, one can

dpdk-prociinfo for stats and xstats, to check the counter rx_no_mbuf
or integrate get_stats and get_xstats for the same counter.

Normally files when open (especially in RW mode) will be cached in 4k Pages or on transparent huge pages (expect for never setting) for performance. Based conversation in the comments it looks like, since caching is in effect DISK IO runs slower which leads to WriterThreadMainFunc to runs slower. To check this behaviour as suggested in comments please

use echo 1 | sudo tee /proc/sys/vm/drop_caches.
try using fflush & fsync periodically.
or create ramdisk and open the file to read and write on ramdisk.

Once the problem is isolated you can use setbuf(f, NULL) to disable buffering at the start itself.

Note: there are mulltitude of other options like create per port-queue, per flow, per flow-port-queue with mmap for the current requirement too.

I do not write all the packets , and the one that are passed to the writer thread are copied into ram before pushed to Queue dpdk-prociinfo can be run at the same time i run my application? — yaron, Jan 27 '22 at 08:25
Current code snippet shared does not share whether packet buffer is copied or not. Also there is no call to tx_burst or mbuf_free. Hence I have to assume you are doing the same. With regard to proc_info yes you can run for your application as long as you are not using `in memory` or only primary option fo r eal_init — Vipin Varghese, Jan 27 '22 at 11:35
1. I still don't understand why caching in memory (which is expected behavior of the os) cause drop in the card. Even if cached files are written to disk. And Cached memory is available when a process request it. 2. Even if if do the fflush and fsync on the major part of the file that are written. Some file read/write will be cached into memory and i will end with drop at nic after a 4 hours instead of after a 2 hours — yaron, Jan 27 '22 at 15:53
As mentioned in the comments and answers mbuf are replenished into mempool when TX or mbuf free is invoked. Moreover the code snippet shared by you is incomplete. I have answered why disk is starts 4k page cache build and how to avoid it. Also asked you to Check the built up rxnombuf counter (to check fail to free mbuf scenario). I am happy to have a chat or debug provided you are ready to debug and share the counter values to help you — Vipin Varghese, Jan 28 '22 at 02:23

full cache linux cause drop at nic

1 Answers1