12

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?

Boost is a very good library, but quite generic, is there a lower latency alternative?

All opinions on writing low-latency UDP network programs are very welcome.

EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

endian
  • 4,234
  • 8
  • 34
  • 42
  • @javapowered Is there anything not contained in the respective man-pages? http://linux.die.net/man/7/epoll and http://linux.die.net/man/2/recvmsg? – sehe Sep 23 '14 at 12:08
  • @sehe I think I should now change my question :) I decided to use normal blocking `recvmsg` because it's easy and likely not bad in "one thread-one socket" config. Accepted answer suggests to use `epoll` (good for many sockets, but should we use it for SINGLE socket?). Will it be actually better comparing to "recvmsg one-thread one-socket"? Is it that bad that `recvmsg` blocks? – Oleg Vazhnev Sep 23 '14 at 12:26
  • @javapowered it's all about choices. the fact that it blocks - on the actual resource might well give it the lowest latency. But the accepted answer really has tons of relevant information on how to squeeze the last drop of responsivity out of your OS – sehe Sep 23 '14 at 12:40

2 Answers2

16

Your latency will vary, but it will be far from the best you can get. Here are few things that will stand in your way to the better latency:

Boost.ASIO

  1. It constantly allocates/deallocates memory to store "state" in order to invoke a callback function associated with your read operation.
  2. It does unnecessary mutex locking/unlocking in order to support a broken mix of async and sync approaches.
  3. The worst, it constantly adds and removes event descriptors from the underlying notification mechanism.

All in all, asio is a good library for high-level application developers, but it comes with a big price tag and a lot of CPU cycle eating gremlins. Another alternative is libevent, it is a lot better, but still aims to support many notification mechanisms and be platform-independent. Nothing can beat native mechanisms, i.e. epoll.

Other things

  1. UDP stack. It doesn't do a very good job for latency sensitive applications. One of the most popular solutions is OpenOnload. It by-passes the stack and works directly with your NIC.
  2. Scheduler. By default, scheduler is optimized for throughput and not latency. You will have to tweak and tune your OS in order to make it latency oriented. Linux, for example, has a lot of "rt" patches for that purpose.
  3. Watch out not to sleep. Once your process is sleeping, you will never get a good wakeup latency compared to constantly burning CPU and waiting for a packet to arrive.
  4. Interference with other IRQs, processes etc.

I cannot tell you exact numbers, but assuming that you won't be getting a lot of traffic, using Boost and a regular Linux kernel, with a regular hardware, your latency will range somewhere between ~50 microseconds to ~100 milliseconds. It will improve a bit as you get more data, and after some point start dropping, and will always be ranging. I'd say that if you are OK with those numbers, don't bother optimizing.

Nikolai Fetissov
  • 82,306
  • 11
  • 110
  • 171
  • That's all excellent information, thank you very much. The amount of traffic is negligible, I will receive maybe 1-2 packets of 30-50 bytes and that's IT for the lifetime of the process, so I'm very much a corner-case. I am currently on Windows but it seems likely that we should move to Linux in the near term, and I will investigate the RT patches that you've mentioned. – endian Dec 06 '11 at 14:12
  • @endian: You are welcome. Linux probably will do you a lot better job. I have never heard of Windows doing low-latency jobs. Also, make sure you optimize hardware first. It is a lot cheaper to get Solarflare NIC and high-end server than optimizing code, improve the code only when you reached the reasonable limit in hardware spendings. –  Dec 06 '11 at 14:15
  • UDT (http://udt.sourceforge.net/) is probably worth a look, though it is opimized for throughput. But maybe as a comparison it might be interesting... – LiMuBei Dec 06 '11 at 14:24
  • 2
    Just to add to Vlad's answer: without hardware acceleration your best bet would probably be a plain `recv()` on a *non-blocking* UDP socket. – Nikolai Fetissov Dec 06 '11 at 14:27
  • Just to add a data point, I am seeing much lower latencies using boost: on the order of 10 mics, INCLUDING my packet processing code. I am on 64-bit linux, using a very high-end server, however. I am getting large latency spikes, though, and more concerningly, UDP buffer overflows. I have the feeling native epoll would clear things up considerably. – Anne Oct 02 '12 at 17:55
  • @xzqx: Lower or higher? There is no way you can get lower. As for epoll, well... it is just an event notification mechanism. But yeah, using edge triggered mode might help a bit. Though you are always better without any event notification & interrupts whatsoever if you can do that. Like.. just spinning on some sort of RDMA buffer. –  Oct 02 '12 at 18:02
  • @Vlad: lower -- you gave estimate of ~50-~100 mics; I said 10 mics in my comment. By "native epoll" I meant using raw epoll/socket calls rather than the boost wrapper, which does other stuff, again as you mentioned in your answer. – Anne Nov 01 '12 at 19:03
  • so what's better `epoll` or `plain recv() on non-blocking socket`? could someone add links to any examples showing how this is works? – Oleg Vazhnev Sep 19 '14 at 06:22
  • also how slow is `libevent`? does it make sense to try to implement everything myself or it's easier to just use `libevent`? – Oleg Vazhnev Sep 19 '14 at 06:47
  • also it seems topic starter need to process just one socket. should we really use `epoll` in THIS case? probably regullar receive (blocking or non-blocking) will be better? – Oleg Vazhnev Sep 19 '14 at 07:00
1

I think using recv() in a "spin" loop thread and attach the thread to a single CPU core(Processor Affinity), the latency should be lower than using select(), the precision of select() varies from 1 to 10 micro-seconds while spin loop at 1 micro-second in my test.

Xavier Lin
  • 328
  • 2
  • 7