I am in the unenviable position of having to debug code that was written by someone 10+ years ago who no longer works at the company.
The premise is fairly simple: this is a Windows based test tool that is intended to communicate with an external device that our company builds. The communication is over RS-232 using a Windows COM port via a USB-to-Serial converter. The communication is a simple request/response scheme. The program runs a continuous loop of successive WriteFile() and ReadFile() calls to communicate with the external device. WriteFile to send a command, followed by ReadFile to read the response.
All works well initially, but after some period of time (roughly 10 minutes - although I haven't confirmed that it's always consistent), the ReadFile call stops working - as in, it times out and returns 0 characters every single time after the initial failure. Since I have the ability to debug the external device simultaneously, obviously the first thing I did was to check if the failure was there, but I have confirmed that even after the ReadFile call stops working, the external device still correctly receives the commands sent via the WriteFile call and responds on the same COM port.
// Flush buffer
PurgeComm(hComm, PURGE_RXABORT|PURGE_RXCLEAR|PURGE_TXABORT|PURGE_TXCLEAR);
// Send command
WriteFile(hComm, dataOut_ptr, write_size, &dwBytesWritten, NULL);
//...
// Read Response
ReadFile(hComm, dataIn_ptr, read_size, &dwBytesRead, NULL);
//This sequence works for a while
//At a certain point, the ReadFile call times out and dwBytesRead is 0
//After that point, every call to ReadFile times out in the same way
//WriteFile still works fine and I know that the external device is still responding on the same UART channel
If I close and re-open the COM port after the timeout as shown below, nothing changes.
//This is the code inside the COM close function
PurgeComm(hComm, PURGE_RXABORT);
CloseHandle(hComm);
//...
//This is the COM open code that gets called in a separate function:
hComm = CreateFile( name,
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
0,
0);
GetCommTimeouts(hComm,&ctmoOld);
ctmoNew.ReadTotalTimeoutConstant = 200;
ctmoNew.ReadTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutMultiplier = 0;
ctmoNew.WriteTotalTimeoutConstant = 0;
SetCommTimeouts(hComm, &ctmoNew);
dcbCommPort.DCBlength = sizeof(DCB);
GetCommState(hComm, &dcbCommPort);
BuildCommDCB("9600,O,8,1", &dcbCommPort);
SetCommState(hComm, &dcbCommPort);
However, if I set a break-point on the external device just before it responds, close the test program and open the COM port in a serial terminal like RealTerm then let the external device proceed, the data comes in fine. At the same time, if I kill and restart the test program entirely, it will also work again for a period of time before again experiencing the same timeout issue.
I have tried playing with the Rx timeout, as well as inserting an additional delay between the WriteFile and ReadFile calls with no success.
I don't get it. Based on this behaviour I don't suspect the Windows USB-to-Serial driver that's being used and feel like there is something going wrong specifically with the use of ReadFile in the test program.
Is there a possibility that the buffer is not being flushed properly and simply stops working because it overflows? Are there known issues with the ReadFile or PurgeComm functions on Windows 10? This is a legacy program that normally runs on a Windows XP machine without issue. I'm having to run it on Windows 10 because I'm using it to test an upgrade of the external device and that's the PC I have.
Edit: To clarify, the "failed" call to ReadFile still returns 1 (so calling GetLastError() is not relevant here), just the number of characters read is 0
Edit 2: Some more details about the communication being attempted...
The Purge-WriteFile-ReadFile sequence alternates between 2 types of commands (same sequence for both commands):
a 'write' command, in which a 134 byte packet (128 byte payload + 6 bytes overhead) is sent to the external device, to which the device responds with a 4-byte 'ok' or 'not ok' handshake
a 'read' command, which is a 6 byte packet with the ID of the data to be read-back (specifically the data that was just written), to which the device responds with a 130 byte (128 bytes data + 2 bytes overhead) response
The timeout always initially occurs during the 'read' command. So the ReadFile call is expecting a length of 130 bytes. After that, the ReadFile call during the 'write' command (where expected bytes read is 4) also times out.