I have to transfer larger files (upto 10GB) using UDP. Unfortunately TCP cannot be used in this use case because there is no bidirectional communication between sender and receiver possible.
Sending a file is not the problem. I have written the client using netty. It reads the file, encodes it (unique ID, position in stream and so on) and sends it to the destination at a configurable rate (packets per seconds). All the packets are received at the destination. I have used iptables and Wireshark to verify that.
The problem occurs with the recipient. Receiving upto 90K packets a second works pretty fine. But receiving and decoding it at this rate is not possible using a single thread. My first approach was to use thread safe queues (one producer and multiple consumer). But using multiple consumers did not lead to better results. Some packets were still lost. It seems that the overhead (locking/unlocking the queue) slows down the process. So I decided to use lmax disruptor with a single producer (receiving the UDP datagrams) and multiple consumer (decoding the packet). But surprisingly, this does not lead to success either. It is hardly a speed advantage to use two lmax consumers and I wonder why.
This is main part receiving UDP packets and call the disruptor
public void receiveUdpStream(DatagramChannel channel) {
boolean exit = false;
// the size of the UDP datagram
int size = shareddata.cr.getDatagramsize();
// the number of decoders (configurable)
int nn_decoders = shareddata.cr.getDecoders();
Udp2flowEventFactory factory = new Udp2flowEventFactory(size);
// the size of the ringbuffer
int bufferSize = 1 << 10;
Disruptor<Udp2flowEvent> disruptor = new Disruptor<>(
factory,
bufferSize,
DaemonThreadFactory.INSTANCE,
ProducerType.SINGLE,
new YieldingWaitStrategy());
// my consumers
Udp2flowDecoder decoder[] = new Udp2flowDecoder[nn_decoders];
for (int i = 0; i < nn_decoders; i++) {
decoder[i] = new Udp2flowDecoder(i, shareddata);
}
disruptor.handleEventsWith(decoder);
RingBuffer<Udp2flowEvent> ringBuffer = disruptor.getRingBuffer();
Udp2flowProducer producer = new Udp2flowProducer(ringBuffer);
disruptor.start();
while (!exit) {
try {
ByteBuffer buf = ByteBuffer.allocate(size);
channel.receive(buf);
receivedDatagrams++; // countig the received packets
buf.flip();
producer.onData(buf);
} catch (Exception e) {
logger.debug("got exeception " + e);
exit = true;
}
}
}
My lmax event is simple...
public class Udp2flowEvent {
ByteBuffer buf;
Udp2flowEvent(int size) {
this.buf = ByteBuffer.allocateDirect(size);
}
public void set(ByteBuffer buf) {
this.buf = buf;
}
public ByteBuffer getEvent() {
return this.buf;
}
}
And this is my factory
public class Udp2flowEventFactory implements EventFactory<Udp2flowEvent> {
private int size;
Udp2flowEventFactory(int size) {
super();
this.size = size;
}
public Udp2flowEvent newInstance() {
return new Udp2flowEvent(size);
}
}
The producer ...
public class Udp2flowProducer {
private final RingBuffer<Udp2flowEvent> ringBuffer;
public Udp2flowProducer(RingBuffer<Udp2flowEvent> ringBuffer)
{
this.ringBuffer = ringBuffer;
}
public void onData(ByteBuffer buf)
{
long sequence = ringBuffer.next(); // Grab the next sequence
try
{
Udp2flowEvent event = ringBuffer.get(sequence);
event.set(buf);
}
finally
{
ringBuffer.publish(sequence);
}
}
}
The interesting but very simple part is the decoder. It looks like this.
public void onEvent(Udp2flowEvent event, long sequence, boolean endOfBatch) {
// each consumer decodes its packets
if (sequence % nn_decoders != decoderid) {
return;
}
ByteBuffer buf = event.getEvent();
event = null; // is it faster to null the event?
shareddata.increaseReceiveddatagrams();
// headertype
// some code omitted. But the code looks something like this
final int headertype = buf.getInt();
final int headerlength = buf.getInt();
final long payloadlength = buf.getLong();
// decoding int and longs works fine.
// but decoding the remaining part not!
byte[] payload = new byte[buf.remaining()];
buf.get(payload);
// some code omitted. The payload is used later on...
}
And here are some interesting facts:
- all decoders work well. I see the number of decoders running
- all packets are received but the decoding takes too long. More precisely: decoding the first two ints and the long value works fine but decoding the payload takes too long. This leads to a 'backpressure' and some packets are lost.
- Fun fact: The code works pretty fine on my MacBook Air but does not work on my server. (MacBook: Core i7; Server: ESXi with 8 virtual Cores on a Xeon @2.6Ghz and no load at all).
Now my questions and I hope that somebody has an idea:
- why does it hardly make a difference to use several consumers? The difference is only 5%
- In general: What is the best way to receive 60K (or more) UDP packets and decode it? I tried netty as receiver but UDP does not scale very well.
- Why is decoding the payload so slow?
- Are there any errors that I have overlooked?
- Should I use another producer / consumer library? LMAX has a very low latency but what's about throughput?