4

I need some help figuring out how to troubleshoot a problem I am witnessing with a high-volume data feed over TCP using .NET sockets.

In a nutshell, when the client application starts, it connects to a specific port on the server. Once connected, the server begins sending real-time data to the client which displays the information in a ticker-like UI. The server supports multiple client workstations, so data will be sent via multiple ports (multiple sockets).

Everything is implemented and working great with a slow feed and low-volume. I am stress testing the system to ensure resiliance and scalability. When I increase the frequency, the server runs perfectly. However, I am seeing what appears to be lost packets on the clients. This occurs at random times.

Currently, each message that is broadcast is prefaced with a 4-byte value identifying the length of that message. As we are receiving data in the client, we append the data to a buffer (Stream) until we receive that number of bytes. Any additional bytes are considered the start of the next message. Again, this works great until I turn up the frequency.

In my test, I send a packet of approx 225 bytes, followed by one of approx 310kB and another around 40kb. Sending a message every 1 second works without fail with about 12 clients running. Increasing the frequency to 1/2 second, I eventually saw one of the client's displays freeze. Going to 1/4 second and I can reproduce the problem with as few as 4 clients within a few seconds.

Looking at my code (which I can provide, if needed), I see that all of the clients are receiving data but somehow the information fell 'out-of-sync' and the expected length value is enormous (in the 100 million range). As a result, we just keep reading data and never perceive the end of the message.

I either need a better approach or a way to ensure I'm getting the data I expect and not losing packets. Can you help?

UPDATE

I've done a ton of additional testing, varying the size of the messages and delivery frequency. There is definitely a correlation. The smaller I make the message sizes, the higher the frequency I can achieve. But, inevitably, I am always able to break it.

So, to more accurately describe what I am looking for is:

  1. To understand what is happening. This will help me identify a possible solution or, at a minimum, establish thresholds for reliable behavior.

  2. Implement a fail-safe mechanism so when the problem occurs, I can handle it and possibly recover from it. Perhaps adding a checksum into the data stream or something like that.

Here is the code that I am running in the client (receiving) applications:

public void StartListening(SocketAsyncEventArgs e)
{
    e.Completed += SocketReceive;
    socket.ReceiveAsync(e);
}

private void SocketReceive(Object sender, SocketAsyncEventArgs e)
{
    lock (_receiveLock)
    {
        ProcessData(e.Buffer, e.BytesTransferred);

        socket.ReceiveAsync(e);
    }
}

private void ProcessData(Byte[] bytes, Int32 count)
{
    if (_currentBuffer == null)
        _currentBuffer = new ReceiveBuffer();

    var numberOfBytesRead = _currentBuffer.Write(bytes, count);

    if (_currentBuffer.IsComplete)
    {
        // Notify the client that a message has been received (ignore zero-length, "keep alive", messages)
        if (_currentBuffer.DataLength > 0)
            NotifyMessageReceived(_currentBuffer);

        _currentBuffer = null;

        // If there are bytes remaining from the original message, recursively process
        var numberOfBytesRemaining = count - numberOfBytesRead;

        if (numberOfBytesRemaining > 0)
        {
            var remainingBytes = new Byte[numberOfBytesRemaining];
            var offset = bytes.Length - numberOfBytesRemaining;

            Array.Copy(bytes, offset, remainingBytes, 0, numberOfBytesRemaining);

            ProcessData(remainingBytes, numberOfBytesRemaining);
        }
    }
}


internal sealed class ReceiveBuffer
{
    public const Int32 LengthBufferSize = sizeof(Int32);

    private MemoryStream _dataBuffer = new MemoryStream();
    private MemoryStream _lengthBuffer = new MemoryStream();

    public Int32 DataLength { get; private set; }

    public Boolean IsComplete
    {
        get { return (RemainingDataBytesToWrite == 0); }
    }

    private Int32 RemainingDataBytesToWrite
    {
        get
        {
            if (DataLength > 0)
                return (DataLength - (Int32)_dataBuffer.Length);

            return 0;
        }
    }

    private Int32 RemainingLengthBytesToWrite
    {
        get { return (LengthBufferSize - (Int32)_lengthBuffer.Length); }
    }

    public Int32 Write(Byte[] bytes, Int32 count)
    {
        var numberOfLengthBytesToWrite = Math.Min(RemainingLengthBytesToWrite, count);

        if (numberOfLengthBytesToWrite > 0)
            WriteToLengthBuffer(bytes, numberOfLengthBytesToWrite);

        var remainingCount = count - numberOfLengthBytesToWrite;

        // If this value is > 0, then we have still have more bytes after setting the length so write them to the data buffer
        var numberOfDataBytesToWrite = Math.Min(RemainingDataBytesToWrite, remainingCount);

        if (numberOfDataBytesToWrite > 0)
            _dataBuffer.Write(bytes, numberOfLengthBytesToWrite, numberOfDataBytesToWrite);

        return numberOfLengthBytesToWrite + numberOfDataBytesToWrite;
    }

    private void WriteToLengthBuffer(Byte[] bytes, Int32 count)
    {
        _lengthBuffer.Write(bytes, 0, count);

        if (RemainingLengthBytesToWrite == 0)
        {
            var length = BitConverter.ToInt32(_lengthBuffer.ToArray(), 0);

            DataLength = length;
        }
    }
}
SonOfPirate
  • 5,642
  • 3
  • 41
  • 97
  • 1
    the bug is probably in your calculation of expected length. show us the codez – Robert Levy Jan 19 '12 at 21:32
  • 1
    Doubtful since the code works perfectly for over an hour this afternoon with 12 clients and data pumping out once per second. As soon as I ramp-up the frequency, it fails. Same code, same calculation - which is just BitConverter.ToInt32(lengthData). – SonOfPirate Jan 20 '12 at 02:08
  • This code is pretty broken. In more than one place, but most notable is how the RemainingLengthBytesToWrite property returns a negative value. This just falls over when multiple Read() calls are required to get the data. Which happens when the volume increases. – Hans Passant Jan 20 '12 at 17:13
  • Please explain. I have yet to see RemainingLengthBytesToWrite return a negative number. While this is theoretically possible, we never write more than 4 bytes to _lengthBuffer, so this never happens. Unless you have something additional you can add to highlight a problem in this area. – SonOfPirate Jan 20 '12 at 18:51

2 Answers2

6

Without seeing your code, we can only guess. My guess is: Are you considering the case where you read less than the full 4-byte header? You might only read one, two, or three bytes of it. Higher data volumes will cause this to happen more often.

Since TCP is a reliable protocol, this is not due to packet loss. Any lost packets result in one of two things happening:

  1. The missing data is retransmitted and the receiver experiences a short pause, but never sees missing data or data out of order.
  2. The socket is closed.

UPDATE

Your IsComplete method returns true after a partial length has been written to the buffer. This causes your receiver code in ProcessData() to discard the length buffer bytes already received, and then gets out of sync.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Yes, I take the minimum of the expected size and actual number of bytes read before populating the buffer. So, if I only have two bytes, I put those in the "length buffer" until I receive two more bytes at which point I can determine how many "data" bytes I need to receive. – SonOfPirate Jan 20 '12 at 02:10
  • I know the socket is not closed because, as I said, I continue to receive data but have somehow fallen out-of-sequence so the 4 bytes I use to determine the message size results in an enormous value and I simply keep adding all of the data I receive to the Stream. – SonOfPirate Jan 20 '12 at 02:12
  • Any suggestions how to troubleshoot this problem given the high volume of messages I'm receiving and the fact that I don't know the condition that causes the problem (aside from the frequency)? – SonOfPirate Jan 20 '12 at 02:26
  • 1
    Well, if you are convinced that the receiver is correct, then the problem must lie in your sending code. (The problem is not in TCP or the system networking layer.) When you send data to the socket, are you verifying the the number of bytes it reports as written matches what you requested? Do you handle the case where it doesn't match? – Greg Hewgill Jan 20 '12 at 03:40
  • We are using NetworkStream.Write and passing the output data in one block. According to the documentation, it will send the requested number of bytes or throw an exception. Because no exception is thrown, I can only assume we are sending correctly. That said, I'm not convinved the receiver is right but know that we do continue to receive data so the socket is not closing. – SonOfPirate Jan 20 '12 at 13:20
  • Yes, in theory you can use Wireshark or other packet sniffing tool to watch the network packets as they go by. This is a good approach when you don't know whether the problem is in the sender or the receiver. Alternately, if your protocol is very simple then you could write a receiver that simply logs data to disk without regard for your block structure. Then examine the captured log file. – Greg Hewgill Jan 20 '12 at 19:32
  • @SonOfPirate: Having said that, it looks like the problem is in your receiver code (see my updated answer). Thanks for posting your code. – Greg Hewgill Jan 20 '12 at 19:53
  • I'll take a closer look but not sure I follow what you are saying. When IsComplete is true, we've received the entire message, notify the app then clear the buffer and start over. If there are any more bytes in the original array, we recursively call ProcessData to read the rest of the array the next 4 bytes we read as the length of the next message. Am I missing something? (Familiarity with the code may have me overlooking something obvious.) – SonOfPirate Jan 21 '12 at 16:00
  • Suppose you only receive one byte (the first byte of the four expected for the length header). You'll call `_currentBuffer.Write()` which records the byte in `_lengthBuffer`. But then `ProcessData` calls `_currentBuffer.IsComplete` which *only* looks at `RemainingDataBytesToWrite` to see whether the buffer is complete. Since you don't even know the full length yet, `DataLength` is zero and `IsComplete` returns `true`. But the buffer *isn't* complete yet, and yet `ProcessData` throws away the buffer with the partial length with `_currentBuffer = null`. – Greg Hewgill Jan 21 '12 at 18:50
  • One way to debug problems with this kind of receiver code is to set up a test framework that calls `ProcessData()` with exactly *one* byte at a time. This kind of test would quickly reveal the problem with your receiver. – Greg Hewgill Jan 21 '12 at 18:52
  • All good catches. I've reworked my code and will get the post updated shortly. One things I've noticed is that the problem tends to appear most reliably when starting a new client after passing whatever the threshold is. Maybe you know what happens if the server starts sending a message immediately after the client connects but the client hasn't called ReceiveAsync on the socket yet? Once the frequency is high enough, this may very well be what is happening. Can I rely on the first packet sent will be the first packet received under these conditions? – SonOfPirate Jan 23 '12 at 17:47
  • Nothing bad happens if you haven't called ReceiveAsync in the receiver yet by the time the sender starts sending. There are no bytes lost from the TCP stream in that situation. The explanation is that your code had an error, so it wasn't working. Have you fixed the error? Does your code work now? – Greg Hewgill Jan 24 '12 at 05:29
  • No, still not working. Started putting more unit tests in place but had to redirect my attention to another area for the moment. Hoping that coming at it with fresh eyes may help. – SonOfPirate Jan 26 '12 at 13:09
0

I don't know if you have ever heard about Network congestion it it seems to be at least part of your problem. If you have a look at the number of calls that you are doing when data comes in (read: your ProcessData method). It is blocking everything else in your server while it has control and is even working recursively.

That means the larger the data you have to process, the longer it takes this method to return. In between, you can not process other incoming messages. So the buffer of your local NIC get filled up, any router involved buffers packets for you and so on. Packets are dropped and resend again and your network is blocked if you can't read faster. This is referred to as the above-mentioned network congestion.

Another thing that was jumping right at me in your code is the lock. Why on earth are you locking anything when you are working asynchronously? You have to understand that your SAEA object is a stateobject for the thread that is processing your async method call. This is intended to take away the pain of doing the threading yourself. You basically call the sockets ReceiveAsync method and throw an SAEA object at it and if the processing is finished, you take the buffer and information from it and return it to the ReceiveAsync method again. Processing takes place in another method that has nothing to do with reading from the socket. This way you keep your socket fast.

And a last one: I don't know what the purpose of your application is, but usually it is discouraged to use TCP for high loads of data that come in at high speeds. That is why the faster network engines of games use UDP. Even the slower of these engines are usually sending at 20 packets a second. If your code breaks at half a second, you should consider switching to UDP. I found Gaffer on Games to be a good source of information about real-time networking with UDP. As he explains the concepts, it might be of use for you as well.

HaMster
  • 533
  • 1
  • 6
  • 17