0

I've got a server that uses a two thread system to manage between 100 and 200 concurrent connections. It uses TCP sockets, as packet delivery guarantee is important (it's a communication system where missed remote API calls could FUBAR a client).

I've implemented a custom protocol layer to separate incoming bytes into packets and dispatch them properly (the library is included below). I realize the issues of using MSG_PEEK, but to my knowledge, it is the only system that will fulfill the needs of the library implementation. I am open to suggestions, especially if it could be part of the problem.

Basically, the problem is that, randomly, the server will drop the client's socket due to a lack of incoming packets for more than 20 seconds, despite the client successfully sending a keepalive packet every 4. I can verify that the server itself didn't go offline and that the connection of the users (including myself) experiencing the problem is stable.

The library for sending/receiving is here:

short ncsocket::send(wstring command, wstring data) {
wstringstream ss;
int datalen = ((int)command.length() * 2) + ((int)data.length() * 2) + 12;
ss << zero_pad_int(datalen) << L"|" << command << L"|" << data;         
int tosend = datalen;
short __rc = 0;
do{
    int res = ::send(this->sock, (const char*)ss.str().c_str(), datalen, NULL);
    if (res != SOCKET_ERROR)
        tosend -= res;
    else
        return FALSE;
    __rc++;
    Sleep(10);
} while (tosend != 0 && __rc < 10); 
if (tosend == 0)
    return TRUE;
return FALSE;
}

short ncsocket::recv(netcommand& nc) {
vector<wchar_t> buffer(BUFFER_SIZE);
int recvd = ::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, MSG_PEEK);
if (recvd > 0) {
    if (recvd > 8) {
        wchar_t* lenstr = new wchar_t[4];
        memcpy(lenstr, buffer.data(), 8);
        int fulllen = _wtoi(lenstr);
        delete lenstr;

        if (fulllen > 0) {
            if (recvd >= fulllen) {
                buffer.resize(fulllen / 2);
                recvd = ::recv(this->sock, (char*)buffer.data(), fulllen, NULL);
                if (recvd >= fulllen) {
                    buffer.resize(buffer.size() + 2);
                    buffer.push_back((char)L'\0');
                    vector<wstring> data = parsewstring(L"|", buffer.data(), 2);
                    if (data.size() == 3) {
                        nc.command = data[1];
                        nc.payload = data[2];
                        return TRUE;
                    }
                    else
                        return FALSE;
                }
                else
                    return FALSE;
            }
            else
                return FALSE;
        }
        else {
            ::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, NULL);
            return FALSE;
        }
    }
    else
        return FALSE;
}
else
    return FALSE;

}

This is the code for determining if too much time has passed:

if ((int)difftime(time(0), regusrs[i].last_recvd) > SERVER_TIMEOUT) {
                regusrs[i].sock.end();
                regusrs[i].is_valid = FALSE;
                send_to_all(L"removeuser", regusrs[i].server_user_id);

                wstringstream log_entry;
                log_entry << regusrs[i].firstname << L" " << regusrs[i].lastname << L" (suid:" << regusrs[i].server_user_id << L",p:" << regusrs[i].parent << L",pid:" << regusrs[i].parentid << L") was disconnected due to idle";
                write_to_log_file(server_log, log_entry.str());
            }

The "regusrs[i]" is using the currently iterated member of a vector I use to story socket descriptors and user information. The 'is_valid' check is there to tell if the associated user is an actual user - this is done to prevent the system from having to deallocate the member of the vector - it just returns it to the pool of available slots. No thread access/out-of-range issues that way.

Anyway, I started to wonder if it was the server itself was the problem. I'm testing on another server currently, but I wanted to see if another set of eyes could stop something out of place or cue me in on a concept with sockets and extended keepalives that I'm not aware of.

Thanks in advance!

  • That first do loop is buggy; if the initial send doesn't take all of the data, you send it all again from the beginning rather than just the data that wasn't sent the first time. Not sure that explains the problem though. How do you send the keepalive packets? Have you examined the data on the wire to ensure they are actually being transmitted in a timely manner? – Harry Johnston Oct 07 '14 at 02:02
  • I haven't examined what's on the wire yet as I can't recreate it reliably enough to do so. I've been connected to the server (since I started hosting it on a different machine) for several consecutive hours. I think a combination of a server change and switching away from MSG_PEEK will help. – Collin Biedenkapp Oct 07 '14 at 04:21

1 Answers1

2

I think I see what you're doing with MSG_PEEK, where you wait until it looks like you have enough data to read a full packet. However, I would be suspicious of this. (It's hard to determine the dynamic behaviour of your system just by looking at this small part of the source and not the whole thing.)

To avoid use of MSG_PEEK, follow these two principles:

  1. When you get a notification that data is ready (I assume you're using select), then read all the waiting data from recv(). You may use more than one recv() call, so you can handle the incoming data in pieces.

  2. If you read only a partial packet (length or payload), then save it somewhere for the next time you get a read notification. Put the packets and payloads back together yourself, don't leave them in the socket buffer.

As an aside, the use of new/memcpy/wtoi/delete is woefully inefficient. You don't need to allocate memory at all, you can use a local variable. And then you don't even need the memcpy at all, just a cast.

I presume you already assume that your packets can be no longer than 999 bytes in length.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • I will switch to select() and an internal buffer, however, efficient or not, we use the string (UTF-16) string length for compatibility with mobile devices. I don't want to fight with endian-ness for something as simple as a length param. Do you think the unreliable of MSG_PEEK is why I'm having random disconnects? – Collin Biedenkapp Oct 07 '14 at 04:17
  • My comments about `new` and `memcpy` do not relate to endian-ness. I don't know whether `MSG_PEEK` is considered unreliable, just that I've never used it myself so I'm unfamiliar with its characteristics. – Greg Hewgill Oct 07 '14 at 06:20