I've got a server that uses a two thread system to manage between 100 and 200 concurrent connections. It uses TCP sockets, as packet delivery guarantee is important (it's a communication system where missed remote API calls could FUBAR a client).
I've implemented a custom protocol layer to separate incoming bytes into packets and dispatch them properly (the library is included below). I realize the issues of using MSG_PEEK, but to my knowledge, it is the only system that will fulfill the needs of the library implementation. I am open to suggestions, especially if it could be part of the problem.
Basically, the problem is that, randomly, the server will drop the client's socket due to a lack of incoming packets for more than 20 seconds, despite the client successfully sending a keepalive packet every 4. I can verify that the server itself didn't go offline and that the connection of the users (including myself) experiencing the problem is stable.
The library for sending/receiving is here:
short ncsocket::send(wstring command, wstring data) {
wstringstream ss;
int datalen = ((int)command.length() * 2) + ((int)data.length() * 2) + 12;
ss << zero_pad_int(datalen) << L"|" << command << L"|" << data;
int tosend = datalen;
short __rc = 0;
do{
int res = ::send(this->sock, (const char*)ss.str().c_str(), datalen, NULL);
if (res != SOCKET_ERROR)
tosend -= res;
else
return FALSE;
__rc++;
Sleep(10);
} while (tosend != 0 && __rc < 10);
if (tosend == 0)
return TRUE;
return FALSE;
}
short ncsocket::recv(netcommand& nc) {
vector<wchar_t> buffer(BUFFER_SIZE);
int recvd = ::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, MSG_PEEK);
if (recvd > 0) {
if (recvd > 8) {
wchar_t* lenstr = new wchar_t[4];
memcpy(lenstr, buffer.data(), 8);
int fulllen = _wtoi(lenstr);
delete lenstr;
if (fulllen > 0) {
if (recvd >= fulllen) {
buffer.resize(fulllen / 2);
recvd = ::recv(this->sock, (char*)buffer.data(), fulllen, NULL);
if (recvd >= fulllen) {
buffer.resize(buffer.size() + 2);
buffer.push_back((char)L'\0');
vector<wstring> data = parsewstring(L"|", buffer.data(), 2);
if (data.size() == 3) {
nc.command = data[1];
nc.payload = data[2];
return TRUE;
}
else
return FALSE;
}
else
return FALSE;
}
else
return FALSE;
}
else {
::recv(this->sock, (char*)buffer.data(), BUFFER_SIZE, NULL);
return FALSE;
}
}
else
return FALSE;
}
else
return FALSE;
}
This is the code for determining if too much time has passed:
if ((int)difftime(time(0), regusrs[i].last_recvd) > SERVER_TIMEOUT) {
regusrs[i].sock.end();
regusrs[i].is_valid = FALSE;
send_to_all(L"removeuser", regusrs[i].server_user_id);
wstringstream log_entry;
log_entry << regusrs[i].firstname << L" " << regusrs[i].lastname << L" (suid:" << regusrs[i].server_user_id << L",p:" << regusrs[i].parent << L",pid:" << regusrs[i].parentid << L") was disconnected due to idle";
write_to_log_file(server_log, log_entry.str());
}
The "regusrs[i]" is using the currently iterated member of a vector I use to story socket descriptors and user information. The 'is_valid' check is there to tell if the associated user is an actual user - this is done to prevent the system from having to deallocate the member of the vector - it just returns it to the pool of available slots. No thread access/out-of-range issues that way.
Anyway, I started to wonder if it was the server itself was the problem. I'm testing on another server currently, but I wanted to see if another set of eyes could stop something out of place or cue me in on a concept with sockets and extended keepalives that I'm not aware of.
Thanks in advance!