I am doing some benchmarks with an optimized Java NIO selector on Linux over loopback (127.0.0.1).
My test is very simple:
- One program sends an UDP packet to another program that echoes it back to the sender and the round trip time is computed. The next packet is only sent when the previous one is acked (when it returns). A proper warm up with a couple of millions messages is conducted before the benchmark is performed. The message has 13-bytes (not counting UDP headers).
For the round trip time I get the following results:
- Min time: 13 micros
- Avg time: 19 micros
- 75% percentile: 18,567 nanos
- 90% percentile: 18,789 nanos
- 99% percentile: 19,184 nanos
- 99.9% percentile: 19,264 nanos
- 99.99% percentile: 19,310 nanos
- 99.999% percentile: 19,322 nanos
But the catch here is that I am spinning 1 million messages.
If I spin only 10 messages I get very different results:
- Min time: 41 micros
- Avg time: 160 micros
- 75% percentile: 150,701 nanos
- 90% percentile: 155,274 nanos
- 99% percentile: 159,995 nanos
- 99.9% percentile: 159,995 nanos
- 99.99% percentile: 159,995 nanos
- 99.999% percentile: 159,995 nanos
Correct me if I am wrong, but I suspect that once we get the NIO selector spinning the response times become optimum. However if we are sending messages with a large enough interval between them, we pay the price of waking up the selector.
If I play around with sending just a single message I get various times between 150 and 250 micros.
So my questions for the community are:
1 - Is my minimum time of 13 micros with average of 19 micros optimum for this round trip packet test. It looks like I am beating ZeroMQ by far so I may be missing something here. From this benchmark it looks like ZeroMQ has a 49 micros avg time (99% percentile) on a standard kernel => http://www.zeromq.org/results:rt-tests-v031
2 - Is there anything I can do to improve the selector reaction time when I spin a single or very few messages? 150 micros does not look good. Or should I assume that on a prod environment the selector will not be quite?
By doing busy spinning around selectNow() I am able to get better results. Sending few packets is still worse than sending many packets, but I think I am now hitting the selector performance limit. My results:
- Sending a single packet I get a consistent 65 micros round trip time.
- Sending two packets I get around 39 micros round trip time on average.
- Sending 10 packets I get around 17 micros round trip time on average.
- Sending 10,000 packets I get around 10,098 nanos round trip time on average.
- Sending 1 million packets I get 9,977 nanos round trip time on average.
Conclusions
So it looks like the physical barrier for the UDP packet round trip is an average of 10 microseconds although I got some packets making the trip in 8 micros (min time).
With busy spinning (thanks Peter) I was able to go from 200 micros on average to a consistent 65 micros on average for a single packet.
Not sure why ZeroMQ is 5 times slower than that. (Edit: Maybe because I am testing this on the same machine through loopback and ZeroMQ is using two different machines?)