-1

Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I see more of the file so it is not failing on a particular part of the file.

Visual Studio 2019. Windows 10.

#define MAXBUFFERSIZE 1024
DWORD bufferSize = MAXBUFFERSIZE;
_int64 fileRemaining;

HANDLE hFile;
DWORD  dwBytesRead = 0;
//OVERLAPPED ol = { 0 };
LARGE_INTEGER dwPosition;

TCHAR* buffer;

hFile = CreateFile(
    inputFilePath,         // file to open
    GENERIC_READ,          // open for reading
    FILE_SHARE_READ,       // share for reading
    NULL,                  // default security
    OPEN_EXISTING,         // existing file only
    FILE_ATTRIBUTE_NORMAL, // normal file    | FILE_FLAG_OVERLAPPED
    NULL);                 // no attr. template

if (hFile == INVALID_HANDLE_VALUE)
{
    DisplayErrorBox((LPWSTR)L"CreateFile");
    return 0;
}

LARGE_INTEGER size;
GetFileSizeEx(hFile, &size);

_int64 fileSize = (__int64)size.QuadPart;
double gigabytes = fileSize * 9.3132e-10;
sendToReportWindow(L"file size: %lld bytes \(%.1f gigabytes\)\n", fileSize, gigabytes);

if(fileSize > MAXBUFFERSIZE)
{
    buffer = new TCHAR[MAXBUFFERSIZE];
}
else
{
    buffer = new TCHAR[fileSize];
}
fileRemaining = fileSize;

sendToReportWindow(L"file remaining: %lld bytes\n", fileRemaining);

while (fileRemaining)                                       // outer loop. while file remaining, read file chunk to buffer
{
    sendToReportWindow(L"fileRemaining:%d\n", fileRemaining);

    if (bufferSize > fileRemaining)                         // as fileremaining gets smaller as file is processed, it eventually is smaller than the buffer
        bufferSize = fileRemaining;

    if (FALSE == ReadFile(hFile, buffer, bufferSize, &dwBytesRead, NULL))
    {
        sendToReportWindow(L"file read failed\n");
        CloseHandle(hFile);
        return 0;
    }

    fileRemaining -= bufferSize;

 // bunch of commented out code (verified that it does not cause the corruption)
}
delete [] buffer;

Debugger html view (512 byte buffer) 512 byte buffer

Debugger html view (1024 byte buffer). This shows that file is probably not the source of the corruption. 1025 byte buffer

Misc notes: I have been told that memory mapping the file does not provide an advantage since I am sequentially processing the file. Another advantage to this method is that when I detect particular and reoccurring tags in the WARC file I can skip ahead ~500 bytes and resume processing. This improves speed.

kbaud
  • 25
  • 7
  • 1
    *HTML Visualizer* assumes HTML. Since you aren't feeding it HTML, it falls back to assuming UTF-8. Since you aren't feeding it UTF-8 either, you observe The Apocalypse. How are we supposed to help? – IInspectable Dec 03 '20 at 16:55
  • This is quite common when someone learns only to use pretty little IDEs, with pretty little buttons, dialogs, and widgets, to work with C++ code. These pretty little IDEs, with pretty little buttons, dialogs, and widgets, all either malfunction in mysterious ways, or impose a bunch of intermediate layers of complexity that hide the underlying data and code. If one was using a traditional command-line debugger, and inspect the actual, raw data in memory, as bits and bytes, there will never be any need to wonder whether the dialog itself is messing up, and the data is right, or it's the data. – Sam Varshavchik Dec 03 '20 at 17:02
  • So the rest of my code, which I did not include for clarity, includes multibytowidechar, etc. This converts the utf-8 and displays it in the window. It displays the contents of the file fine up to the point where the buffer is corrupted. Since the corruption is also visible in the debugger view and it persists with or without my later code, I thought my simplified presentation would make it easier for people to understand. You guys both made assumptions about the tools and somehow my score is dinged? It is not just a problem with the debugger. Probably my code? – kbaud Dec 03 '20 at 17:36
  • When dealing with character encoding issues like this, it helps to look at the raw bytes of the file and the raw bytes of the buffer after reading from the file. This could just be an issue with the UI display of the data and not an issue with the data itself. – Remy Lebeau Dec 03 '20 at 17:52
  • In debugger you should display only a number of really read bytes, because, if you display all of them, there will be bytes from previous reading, if last operation was shorter... – Stepan Pavlov Dec 03 '20 at 18:57
  • I haven't counted the number of characters in the debugger window but I can tell you that when I use the rest of my code to convert them to what they should be, it works fine until about half way through the buffer. Then I get a bunch of "iiiiiiiii". I was hoping the debug view would be a simpler representation of the problem... – kbaud Dec 03 '20 at 21:13
  • Try to initialize `buffer` when allocating memory for it.Like `buffer = new TCHAR[size]{};` – Zeus Dec 04 '20 at 02:30
  • Have you tried `dwBytesRead` to cut the garbage? – Stepan Pavlov Dec 04 '20 at 03:03
  • Zhu Song. kudos on an interesting suggestion. Added the brackets like you suggested and for a 1024 byte buffer I only saw 513 bytes in the debug window. It gives a fatal error. Weird effect. Could it be that tchar is selecting a narrow character? – kbaud Dec 04 '20 at 03:14
  • yeah, changing TCHAR to wchar_t produced the same effect so the unicode is working as it should. This doesn't explain why the buffer is half the size I requested. – kbaud Dec 04 '20 at 03:18
  • @kbaud I think the reason is that you use a buffer array of type `TCHAR`, and the size of TCHAR type is 2 bytes. So the `bufferSize` set when you call the `ReadFile` function is actually filled into the `buffer` array every 2 bytes , But the actual size of the buffer is `sizeof(TCHAR) * fileSize`, so half of the buffer array you see is `corrupted`. – Zeus Dec 04 '20 at 08:57
  • @ZhuSong-MSFT. That did it! Thank you so much. How do I mark your answer as the answer? I ended up changing "bufferSize" in the ReadFile line to "bufferSize * 2". 2 characters to fix a problem I have spent 2 weeks on! wow. Thank You again. – kbaud Dec 04 '20 at 17:48

1 Answers1

2

The reason is that you use a buffer array of type TCHAR, and the size of TCHAR type is 2 bytes. So the bufferSize set when you call the ReadFile function is actually filled into the buffer array every 2 bytes.

But the actual size of the buffer is sizeof(TCHAR) * fileSize, so half of the buffer array you see is "corrupted"

Zeus
  • 3,703
  • 3
  • 7
  • 20