4

We have a "publisher" application that sends out data using multicast. The application is extremely performance sensitive (we are optimizing at the microsecond level). Applications that listen to this published data can be (and often are) on the same machine as the publishing application.

We recently noticed an interesting phenomenon: the time to do a sendto() increases proportionally to the number of listeners on the machine.

For example, let's say with no listeners the base time for our sendto() call is 5 microseconds. Each additional listener increases the time of the sendto() call by about 2 microseconds. So if we have 10 listeners, now the sendto() call takes 2*10+5 = 25 microseconds.

This to me suggests that the sendto() call blocks until the data has been copied to every single listener.

Analysis of the listening side supports this as well. If there are 10 listeners, each listener receives the data two microseconds later than the previous. (I.e., the first listener gets the data in about five microseconds, and the last listener gets the data in about 23--25 microseconds.)

Is there any way, either at the programmatic level or the system level to change this behavior? Something like a non-blocking/asynchronous sendto() call? Or at least block only until the message is copied into the kernel's memory, so it can return without waiting on all the listeners)?

Matt
  • 952
  • 2
  • 8
  • 17
  • The only behaviour is to block until the memory is copied into kernel spaces SKBs. Zero-copy is only possible through the lower PF_PACKET interfaces that Wireshark & `tcpdump` use. – Steve-o Jul 29 '11 at 08:46
  • what are the routes on your machine? In particular, do you have a route set up for the multicast group or the entire 224.0.0.0 /4 block? – Foo Bah Jul 29 '11 at 12:59

2 Answers2

0

Sorry for asking the obvious, but is the socket nonblocking? (add O_NONBLOCK to the set of flags for the port -- see fcntl)

Foo Bah
  • 25,660
  • 5
  • 55
  • 79
  • Tried that, but it doesn't appear to make any difference. The sending thread still "blocks" until all receivers have fully received the data. – Matt Feb 23 '12 at 22:24
0

Multicast loop is incredibly inefficient and shouldn't be used for high performance messaging. As you noted for every send the kernel is copying the message to every local listener.

The recommended approach is to use a separate IPC method to distribute to other threads and processes on the same host, either shared memory or unix sockets.

For example this can easily be implemented using ZeroMQ sockets by adding an IPC connection above a PGM multicast connection on the same ZeroMQ socket.

Steve-o
  • 12,678
  • 2
  • 41
  • 60
  • Alternatively, Matt's team could invest time in fixing the kernel's performance with local listeners. – Zan Lynx Jul 29 '11 at 06:01
  • @Zan correct but I think the options in the kernel are pretty limited compared to zero-copy userspace methods like [LMAX's disruptor](http://code.google.com/p/disruptor/). – Steve-o Jul 29 '11 at 06:07
  • Why would sendto block if the port is nonblocking? copy happens once up into kernel buffers – Foo Bah Jul 29 '11 at 12:57
  • @Foo the term *blocking* here is a bit ingenuous, the question refers to the excessive time required to duplicate the sent packets to every receiver on the host. There is no magical thread within the kernel to perform this in the background. – Steve-o Jul 29 '11 at 13:15
  • @Steve-o I've seen but never used ZeroMQ. How do ZeroMQ sockets perform compared to a shared-memory queue / dropbox implementation? do they lock? – Foo Bah Jul 29 '11 at 13:35
  • @Foo you can get millions messages per second easily, and is asynchronous, IPC is however on Unix sockets and could be improved with shared memory even further like 29West's UMS/LBM if you have large Nehalem applications. Windows is still TCP socket only for IPC, awaiting a better implementation. – Steve-o Jul 29 '11 at 13:48
  • I agree, but there's a caveat to shared memory: at least in our experience, you don't get much of a latency improvement if you still rely on the kernel to wake-up the receiving thread. (E.g. if the reader is based around select()). The least-latency approach is a busy-wait scheme, but you burn up a whole CPU. If you have more threads than CPUs, then you get into a difficult "optimization with trade-offs" problem. – Matt Feb 23 '12 at 22:10