10

I am doing some benchmarks with an optimized Java NIO selector on Linux over loopback (127.0.0.1).

My test is very simple:

  • One program sends an UDP packet to another program that echoes it back to the sender and the round trip time is computed. The next packet is only sent when the previous one is acked (when it returns). A proper warm up with a couple of millions messages is conducted before the benchmark is performed. The message has 13-bytes (not counting UDP headers).

For the round trip time I get the following results:

  • Min time: 13 micros
  • Avg time: 19 micros
  • 75% percentile: 18,567 nanos
  • 90% percentile: 18,789 nanos
  • 99% percentile: 19,184 nanos
  • 99.9% percentile: 19,264 nanos
  • 99.99% percentile: 19,310 nanos
  • 99.999% percentile: 19,322 nanos

But the catch here is that I am spinning 1 million messages.

If I spin only 10 messages I get very different results:

  • Min time: 41 micros
  • Avg time: 160 micros
  • 75% percentile: 150,701 nanos
  • 90% percentile: 155,274 nanos
  • 99% percentile: 159,995 nanos
  • 99.9% percentile: 159,995 nanos
  • 99.99% percentile: 159,995 nanos
  • 99.999% percentile: 159,995 nanos

Correct me if I am wrong, but I suspect that once we get the NIO selector spinning the response times become optimum. However if we are sending messages with a large enough interval between them, we pay the price of waking up the selector.

If I play around with sending just a single message I get various times between 150 and 250 micros.

So my questions for the community are:

1 - Is my minimum time of 13 micros with average of 19 micros optimum for this round trip packet test. It looks like I am beating ZeroMQ by far so I may be missing something here. From this benchmark it looks like ZeroMQ has a 49 micros avg time (99% percentile) on a standard kernel => http://www.zeromq.org/results:rt-tests-v031

2 - Is there anything I can do to improve the selector reaction time when I spin a single or very few messages? 150 micros does not look good. Or should I assume that on a prod environment the selector will not be quite?


By doing busy spinning around selectNow() I am able to get better results. Sending few packets is still worse than sending many packets, but I think I am now hitting the selector performance limit. My results:

  • Sending a single packet I get a consistent 65 micros round trip time.
  • Sending two packets I get around 39 micros round trip time on average.
  • Sending 10 packets I get around 17 micros round trip time on average.
  • Sending 10,000 packets I get around 10,098 nanos round trip time on average.
  • Sending 1 million packets I get 9,977 nanos round trip time on average.

Conclusions

  • So it looks like the physical barrier for the UDP packet round trip is an average of 10 microseconds although I got some packets making the trip in 8 micros (min time).

  • With busy spinning (thanks Peter) I was able to go from 200 micros on average to a consistent 65 micros on average for a single packet.

  • Not sure why ZeroMQ is 5 times slower than that. (Edit: Maybe because I am testing this on the same machine through loopback and ZeroMQ is using two different machines?)

Julie
  • 101
  • 1
  • 5
  • I think that much of this is due to HotSpot JVM warmup times rather than the behavior of selectors specifically. – user207421 Aug 24 '12 at 00:05
  • 1
    Thanks @EJP, but I did do some warmup with the JVM in -server mode. I sent a couple of million messages before I sent the messages that will trigger the benchmark. Why do you think that is happening => "If I play around with sending just a single message I get various times between 150 and 250 micros." – Julie Aug 24 '12 at 01:16
  • call me crazy but why dont you just reimplement your (from description) short program in C and see the performance. – NoSenseEtAl Aug 24 '12 at 12:40
  • @NoSenseEtAl Call me nuts but I would love to have a C implementation of a Non-blocking selector that my Java program can call through JNI. Any such powerful thing somewhere? – Julie Aug 24 '12 at 13:05
  • @Julie My suggestion was regarding warm/cold performance... You could write simple UDP code in C and run it for 1M and 10 msgs and see if it has same distribution - if it has it is prob not selector warmup problem . Regarding C implementation- I have no idea, although wiki suggest it could be done : "A POSIX-compliant operating system, for example, would have direct representations of these concepts, select(). " Also you might wanna check out LMAX Distruptor, not just for the Disruptor, they have a lot of blogs explaining how to write low latency Java code. – NoSenseEtAl Aug 24 '12 at 13:46
  • @5x: http://www.zeromq.org/results:10gbe-tests-v031 Also like I said check out LMAX Disruptor, AFAIK they have really good latency numbers. – NoSenseEtAl Aug 24 '12 at 14:58
  • @NoSenseEtAl Disruptor is something else. It is messaging passing between threads. I am more interested in network I/O latencies here. But I think I know why I am much faster than ZeroMQ. Check my edit. – Julie Aug 24 '12 at 15:52
  • I know what Disruptor is, but you are on same machine. Now I see you care about it being over UDP. BTW if you ever try out zmq_inproc please update your post if you have time. If you dont want to write code you can just try to modify some of the xamples from ZMQ guide. Ofc I know inproc uses interthread communication, Im just curios how it compares to your solution and ZMQ tcp – NoSenseEtAl Aug 24 '12 at 16:24
  • @NoSenseEtAl Yes, for low latency is has to be UDP. Also for broacasting (one-to-many queue) it has to be UDP. I am trying to find ZeroMQ loopback benchamarks. I won't be able to write ZeroMQ code to test that. – Julie Aug 24 '12 at 16:34
  • 0MQ doesnt support UDP :/ Also for TCP I get like under 10k MPS... but it is not comparable. – NoSenseEtAl Aug 24 '12 at 17:03
  • @NoSenseEtAl You are kidding me that ZeroMQ does NOT support UDP? How can you do the one-to-many publisher-subscriber messaging model without UDP broadcast? – Julie Aug 24 '12 at 17:12
  • it supports multicast, but "udp" doesnt exist as an option to connect(), Im a noob but read this :(http://api.zeromq.org/3-1:zmq-pgm) and Answers on SO question 8492377. ALSO note that both pgm and epgm are documented as RELIABLE multicast – NoSenseEtAl Aug 24 '12 at 17:20
  • @NoSenseEtAl You can make UDP reliable on the application level. It does not look like PGM is implemented on top of UDP. UDP broadcast is pretty fast with good switchers. Not sure how PGM or ZeroMQ intend to beat that. – Julie Aug 24 '12 at 17:33
  • I know, my point was that their multicast protocols are reliable, so they are likely to suck when it comes to speed(compared to UDP). And why they dont have UDP... IDK... they claim that 0MQ is the best thing ever(concurrency framework that is super cool Erlang like epicn win, not just msq q ) so maybe UDP doesnt fit with that. – NoSenseEtAl Aug 24 '12 at 18:43

2 Answers2

4

You often see cases there waking a thread can be very expensive, not just because it takes time for the thread to wake up, but the thread runs 2-5x slower for tens of micro-seconds afterwards as the caches and

The way I have avoided this in the past is to busy wait. Unfortunately selectNow creates a new collection every time you call it even if it is an empty collection. This generates so much garbage its not worth using.

One way around this it to busy wait on non-blocking sockets. This doesn't scale particularly well but can give you the lowest latency as the thread doesn't need to wake and the code you run after this is more likely to be in cache. If you use thread affinity as well, it can reduce your threads disturbance.

What I would also suggest is trying to make your code lock less and garbage less. If you do this you can can have a process in Java which sends a response to an incoming packet under 100 micro-seconds 90% of the time. This would allow you to process each packet at 100 Mb as they arrive (up to 145 micro-seconds apart due to bandwidth limitations) For a 1 Gb connection you can get pretty close.


If you want fast interprocess communication on the same box in Java, you could consider something like https://github.com/peter-lawrey/Java-Chronicle This uses shared memory to pass messages with round trip latencies (which is harder to do efficiently with sockets) of less than 200 nano-seconds. It also persists the data and is useful if you just want a fast way to produce a journal file.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Hi Peter. Please see my new results based on your comments. Any idea why ZeroMQ is 5 times slower than that? – Julie Aug 24 '12 at 14:25
  • ZeroMQ has to do more than just send a packet on a Single socket. It has to do more work, routing etc, so its latency will be higher. I also suspect it use a background thread to do the send receiving which improves manageability and control over connections (or at least many of these libraries do) One of the trade offs you often see is that by batching message using a sending thread, you can increase throughput 10 fold which is what many libraries focus on rather than latency. – Peter Lawrey Aug 24 '12 at 16:12
  • I suspect the difference is because I am testing this over LOOPBACK. I am trying to find ZeroMQ benchmarks over loopback to compare. A sending thread!? That's terrible! Why can't you just call channel write and let the OS do the rest? For low-latency anything different than NIO is non-sense IMHO. – Julie Aug 24 '12 at 16:32
  • @Julie With NIO, its worth remembering you can do blocking NIO, and busy waiting for send/receiving. Using a Selector is not the only option. For testing over a real low latency network I would suggest trying solarflare as they have a library which supports kernel bypass from Java without having to use JNI. You can achieve single digit latencies from Java this way. – Peter Lawrey Aug 24 '12 at 18:11
  • Is it possible to implement Kernel Bypass just with code or do I need any specialized hardware or software to do it? Any example somewhere? – Julie Aug 26 '12 at 02:35
  • 1
    @Julie In theory you can implement kernel bypass with any device. However, to do this you need to implement you own device driver which would cost far, far more in development than buying a device which supports this already. There are a number of network adapters which support kernel bypass and most require you to implement the integration with C. Solarflare supports using it from Java without additional development. – Peter Lawrey Aug 26 '12 at 06:52
  • 1
    Hello Peter, does this statement: `Unfortunately selectNow creates a new collection every time you call it even if it is an empty collection` still hold true? I tried looking at the SelectorImpl.class's poll and epoll implementation's doSelect() and updateSelectedKeys(). both of it update the internal selectedKey set by using add() – experiment unit 1998X May 19 '23 at 08:55
  • @experimentunit1998X Good question, It is probably JVM-specific so I would check what it does. – Peter Lawrey May 20 '23 at 18:09
  • I see, which java version were you referring to when the statement was made? I think I checked either openJDK or oracle – experiment unit 1998X May 21 '23 at 09:03
-1

If you tune your selector right, you can get inter-socket communication in Java in less than 2 microseconds. Here are my one way results for a 256-byte UDP packet:

Iterations: 1,000,000
Message Size: 256 bytes
Avg Time: 1,680 nanos
Min Time: 1379 nanos
Max Time: 7020 nanos
75%: avg=1618 max=1782 nanos
90%: avg=1653 max=1869 nanos
99%: avg=1675 max=1964 nanos
99.9%: avg=1678 max=2166 nanos
99.99%: avg=1679 max=5094 nanos
99.999%: avg=1680 max=5638 nanos

I talk more about Java NIO and the reactor pattern in my article Inter-socket communication with less than 2 microseconds latency.

TraderJoeChicago
  • 6,205
  • 8
  • 50
  • 54