5

This is just a general question relating to some high-performance computing I've been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).

My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales ;)

Their messaging is UDP-based, as I understand it, so there's no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!

Best wishes,

Mike

Michael_73
  • 353
  • 3
  • 11
  • This is going to be *very* system specific. You should mention the system(s) you are interested in more specifically. – ndim Apr 21 '10 at 13:31
  • Fair enough, I'm most interested in how this relates to Linux. I've got OSX running on the desktop but can remote to a Linux server to play with test programs. – Michael_73 Apr 21 '10 at 13:37

2 Answers2

1

To reduce latency in High-performance, you should decline to use a kernel driver. Smallest latency will be achieved with user-space drivers (MX does it, Infinband may be too).

There is a rather good (but slightly outdated) overview of linux networking internals "A Map of the Networking Code in Linux Kernel 2.4.20". There are some schemes of TCP/UDP datapath.

Using raw sockets will make path of tcp packets a bit shorter (thanks for an idea). TCP code in kernel will not add its latency. But user must handle all tcp protocol itself. There is a some chance of optimizing it for some specific situations. Code for clusters don't require handling of long distance links or slow links as for default TCP/UDP stack.

I'm very interested in this theme too.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • networking internals from 2.4.20 (NAPI) is still here for 2.6. But there is new sendfile(sendpage)/splice interfaces for eliminate copies. – osgx Apr 22 '10 at 03:23
  • It seems like a very interesting topic. I'm also interested in it from the perspective of a Java engineer - to what extent can networking performance (throughput/latency/no GC) by handing this off to a native high-performance networking implementation. Having read that Java was conceived in part as a networking language, I was slightly surprised to read a paper recently that decried the JVM's networking-copying inefficiencies, though this was at least in part in regard to JNI. Perhaps one future direction for the JVM could be to do something special with some of the target OS' networking code. – Michael_73 Apr 22 '10 at 19:52
  • Incidentally, Stevens' book "Unix Network Programming" has a neat way of stopping the OS sending an RST if you're trying to receive TCP packets through a PF_PACKET socket or BPF/pcap/libnet variant. – Michael_73 Apr 22 '10 at 19:59
  • @Michael_73, interesting... can you give more precise link to this feature? I'm also don't know for now, how does OS filter incoming packets to distinguish packets to RAW socket from other packages. – osgx Apr 22 '10 at 21:28
  • Got lucky on Google books - they've not removed that particular page. See the single indented paragraph at the bottom of page 794 http://books.google.co.uk/books?id=ptSC4LpwGA0C&lpg=PA794&ots=Kq7zLmcjUu&pg=PA794#v=onepage&q&f=false. Otherwise google for "One way around this is to send TCP segments with a source IP address that belongs to the attached subnet" ;) – Michael_73 Apr 23 '10 at 22:41
1

There are some pictures http://vger.kernel.org/~davem/tcp_output.html Googled with tcp_transmit_skb() which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/

In user - tcp transmit part of datapath there is 1 copy from user to skb with skb_copy_to_page (when sending by tcp_sendmsg()) and 0 copy with do_tcp_sendpages (called by tcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)

Call paths (manually from lxr). Sending tcp_push_one/__tcp_push_pending_frames

tcp_sendmsg() <-  sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev

tcp_sendpage() <- file_send_actor <- do_sendfile 

Receive tcp_recv_skb()

tcp_recvmsg() <-  sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev

tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older

In receive there can be 1 copy from kernel to user skb_copy_datagram_iovec (called from tcp_recvmsg). And for tcp_read_sock() there can be copy. It will call sk_read_actor callback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.

For udp - receive = 1 copy -- skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy -- udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)

Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.

UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?

osgx
  • 90,338
  • 53
  • 357
  • 513
  • "The Performance Analysis of Linux Networking – Packet Receiving" (thnx to http://hackingnasdaq.blogspot.com/2010/01/myth-of-procsysnetipv4tcplowlatency.html - myth of tcp_low_latency sysctl) – osgx Apr 22 '10 at 12:45
  • hackingnasdaq.blogspot.com - this blog is very interesting. There are a lot of posts about low-latency linux networking – osgx Apr 22 '10 at 13:05
  • **"Potential performance bottleneck in Linux TCP"** is another **very good** (and longer) paper by WWu & MCrawford about linux network packet path – osgx Apr 22 '10 at 13:36
  • Wow - superb answer. Too bad you can't mod them up by more than one point... but then I can see where that might lead! Have modded up your other answer too though. Cheers osgx! Spasiba – Michael_73 Apr 22 '10 at 19:56
  • @Michael_73, I hope this will be a part of my thesis :) – osgx Apr 22 '10 at 21:31
  • http://tservice.net.ru/~s0mbre/old/?section=projects&item=recv_zero_copy -- project of zerocopy receive – osgx Apr 23 '10 at 15:57
  • @Michael_73, http://lion.cs.uiuc.edu/courses/cs498hou_spring05/lectures.html good slides for tcp (14-16 lect.) – osgx Apr 24 '10 at 21:36
  • @Michael_73, and the best images in the slides are stolen from book "The Linux® Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel" %) – osgx Apr 24 '10 at 22:52
  • @osgx Thanks very much for the links, mate. Awesome stuff. – Michael_73 Apr 26 '10 at 11:23