-2

I was trying to make a very basic text editor with Win32 that has the ability to read files and change the text of an edit control to it. I want it to be able to handle chars in all languages, so I tried to use a LPWSTR for the second parameter of ReadFile(), like this:

HANDLE file = CreateFile(_T("D:\\C++ Stuff\\Testing.txt"), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
DWORD fileSize = GetFileSize(file, NULL);
LPWSTR buffer = (LPWSTR)GlobalAlloc(GPTR, fileSize + 1);
DWORD read;
ReadFile(file, buffer, fileSize, &read, NULL);
MessageBox(NULL, buffer, NULL, NULL);
GlobalFree(buffer);

But the MessageBox shows up with a bunch of gibberish! If I use debug mode and add a watch to buffer, it's still the same. It makes no difference if the file opening contains UTF-16 encoded chars or not. Is this normal? If yes, is there any alternative way to read the file into a LPWSTR? If no, how to fix it? I'm using Visual Studio 2015 for this project.

P.S. The code provided is only an example. In the actual code, I have checks for if CreateFile(), GetFileSize(), GlobalAlloc() and ReadFile() failed or not and null-termination of buffer.

Tyler Tian
  • 597
  • 1
  • 6
  • 22
  • One obvious issue is that your buffer probably isn't null-terminated. Otherwise it's impossible to know what the problem could be without knowing what the data in the file consists of. You're also doing no error checking - so for instance, the `CreateFile` may actually be failing. – Jonathan Potter Mar 26 '16 at 01:18
  • @JonathanPotter This is only an example... In the actual program I have error checking and null-termination and everything. – Tyler Tian Mar 26 '16 at 01:23
  • Don't show examples, show real code. – Jonathan Potter Mar 26 '16 at 01:24

1 Answers1

3

If the text file is in ASCII/UTF-8, then reading it as raw bytes into a wide character (LPWSTR) will result in very odd garbage, because e.g. the characters ABCD (ASCII/UTF-8 encoded as 65, 66, 67, 68) would be instead encoded as two wide-character values of 0x4142 0x4344).

Check whether your text file is ASCII/UTF-8 or wide character, and note that Windows generally adds two unicode indicator bytes (0xFFFE) that no other platform supports, so even if your text file is wide character, you'll probably see weird characters from the indicator bytes.

If you need unicode, and cannot change your project to use ASCII (LPSTR), then you can either read into a byte array and then convert using the COM library function MultiByteToWideChar provided by Windows, or you can just read each byte and type-cast to wchar_t, then store in your ,

for(int position = 0; position < filesize; position++)
    buffer[position] = (wchar_t)byte_buffer[position];

or equivalent.

Matt Jordan
  • 2,133
  • 9
  • 10
  • 1
    Thanks, it works, but if I have chars that are represented by two bytes, such as a Chinese symbol, it won't display correctly. I know why this is happening, but is there any way to let `ReadFile()` to read out wide chars in the beginning? I'm not sure how Windows store Chinese chars, it seems like regular ASCII chars are one byte, and others are two bytes, so the text document is part UTF-8, part UTF-16... – Tyler Tian Mar 26 '16 at 01:47
  • Windows itself stores all characters as UTF-16/wchar_t internally, but most applications don't care, and often store in UTF-8 (superset of ASCII). MultiByteToWideChar should work with that, but note that wchar_t is 15 years old (Windows 2000, I believe) - it was enough to represent all unicode then, but may not be enough now. Usually the application that saves it will tell you in the save dialog; if it is non-interactive, then it may be a UTF-32 unicode encoding that cannot be represented in UTF-16/wchar_t. How was the .txt file generated? – Matt Jordan Mar 26 '16 at 01:56
  • I used Windows Notepad to do that. – Tyler Tian Mar 26 '16 at 02:33
  • @MattJordan: "*note that wchar_t is 15 years old (Windows 2000, I believe)*" - you are thinking of UCS-2. UTF-16 replaced UCS-2 while preserving backwards compatibility with UCS-2. Microsoft stitched from UCS-2 to UTF-16 in Windows 2000. The 16bit `wchar_t` data type is used for both encodings, and is still widely used in Windows programming and happily handles UTF-16 just fine. Unicode codepoints above U+FFFF are handled in UTF-16 using surrogate pairs. All UTF-X encodings (UTF-7, UTF-8, UTF-16, UTF-32) can represent the *entire* Unicode repertoire of codepoints. – Remy Lebeau Mar 26 '16 at 02:38
  • @Remy Lebeau No, I'm referring to the wide-char character set, which only supports about 65,000 characters, and actually significantly less because of unused-but-allocated ranges. The code page allocation probably could have been packed better, but back then, it seemed like plenty, since most people used the Latin character set anyway. For a while, Microsoft was beginning to switch to UTF-32, but now looks to be heading toward UTF-8 instead, as almost every other OS uses UTF-8. That will take a long time, and could even change again. – Matt Jordan Mar 26 '16 at 02:45
  • @MattJordan: your code example to loop through the raw files bytes type-casting each individual byte to `wchar_t` is the **worst** way to handle this situation, and will only work correctly for files that are encoded in ASCII or Latin-1/ISO-8859-1, where the byte values and the corresponding codepoint values are the same values. It will not work for other charsets. `MultiByteToWideChar()` is the correct solution. – Remy Lebeau Mar 26 '16 at 02:45
  • @MattJordan: "*I'm referring to the wide-char character set*" - that is formally known as UCS-2. What you said about "*65,000 characters*" is true for UCS-2, but not for UTF-16. Windows native Unicode encoding is UTF-16. Microsoft never leaned towards UTF-32, and will never switch to UTF-8. And yes, while the `wchar_t` data itself itself is only 16 bits on Windows (32 bits on other platforms) and thus can only represent numeric values 0-65535, *two* `wchar_t` values acting together comprise a *surrogate pair* in UTF-16, thus supporting encoding Unicode codepoints higher than 65535 in value. – Remy Lebeau Mar 26 '16 at 02:48
  • Notepad is a unique case, since it doesn't understand encodings very well, but can preserve the encoded characters. The Chinese characters are probably valid in UTF-16, since Windows is UTF-16 internally, but Notepad may either be mixing UTF-16 and ASCII, or it may be encoding using its own choice of code page, which could make a mess of the resulting file. The first thing I would suggest is to load it into something that is a bit more strict, and make sure the encoding is valid - maybe MS Word, although I don't know if it enforces .TXT encodings. – Matt Jordan Mar 26 '16 at 02:48
  • [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – Remy Lebeau Mar 26 '16 at 02:51
  • @Remy Lebeau I realize the loop is a bad way, which is the reason I mentioned MultiByteToWideChar first. However, some people are not comfortable with using COM, and I didn't know the text document included Chinese characters at the time, so I included it as a second option, but not the first option. Note that I am aware of the surrogate pair concept, but I am also aware that most software doesn't consistently support that, effectively reducing it to UCS-2. – Matt Jordan Mar 26 '16 at 02:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107386/discussion-between-remy-lebeau-and-matt-jordan). – Remy Lebeau Mar 26 '16 at 03:02
  • 2
    A great deal of misinformation here from Matt – David Heffernan Mar 26 '16 at 05:48
  • @MattJordan: MultiByteToWideChar is not COM function!!! – user2120666 Mar 26 '16 at 14:12