0

I'm reading a file of thousands of non-English strings, many of them East Asian, using fgets, and subsequently calling MultiByteToWideChar to convert them back to Unicode:

WCHAR wstr[BUFSIZ] = { '\0' };
int result = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, src, -1, wstr, BUFSIZ);

This approach is working fine in nearly every case. The two strings for which it isn't working are:

我爱你  (read in by fgets as "我爱ä½")
コム    (read in by fgets as "コãƒ")

In both cases, the call to MultiByteToWideChar returns zero, and the final character of wstr is garbage:

我爱�  (final character xE4xBD)
コ�    (final character xE3x83)

Is there some environmental set-up, or alternative manner of reading my text file, that would eliminate this problem?

MiloDC
  • 2,373
  • 1
  • 16
  • 25
  • What is `BUFSIZ`? – Paul Sanders Dec 24 '20 at 12:42
  • Is your source saved as unicode? – Michael Chourdakis Dec 24 '20 at 12:45
  • BUFSIZ is 512. The source file is UTF-8 encoded. – MiloDC Dec 24 '20 at 13:14
  • What do you get if you don't specify `MB_ERR_INVALID_CHARS`? – Paul Sanders Dec 24 '20 at 13:40
  • Looks like your input is corrupted and you lost the last byte. Length should be multiple of 3, but they are length 8 and 5 instead of 9 and 6. My guess is that you tried to remove the \n, but these strings didn't end in \n, so you removed a payload byte by mistake. – Raymond Chen Dec 24 '20 at 14:02
  • @Paul: If I don't specify `MB_ERR_INVALID_CHARS`, nothing changes. I still get the garbage character at the end. – MiloDC Dec 24 '20 at 17:13
  • @Raymond: Nowhere in code am I chopping off newline characters. In fact, the strings are read by `fgets` as having 8 and 5 characters (they're read into a buffer that is `BUFSIZ`, i.e. 512, bytes long). This is why I'm wondering whether there's some other convention for reading the file. – MiloDC Dec 24 '20 at 17:16
  • 1
    Well, **somebody** is deleting them. Curiously, the missing byte is `0xa0` in both cases. – Raymond Chen Dec 24 '20 at 17:21
  • That would seem to be the case, Raymond, yes. Interestingly, I don't have this problem of lost information with any of the thousands of other East Asian strings that I'm reading. – MiloDC Dec 24 '20 at 17:23

1 Answers1

0

I found the problem, thanks to Raymond Chen's observation that the number of bytes in the source string was incorrect for 我爱你 and コム.

The code that I'm debugging trims the string read by fgets for whitespace, which winds up corrupting them, since the ASCII versions of 我爱你 and コム apparently both end in whitespace.

MiloDC
  • 2,373
  • 1
  • 16
  • 25
  • I have no idea how Raymond was able to determine the number of bytes in your string based on the information provided, but this is why on encoding transformation functions, you must always provide a hexdump of the input and output. – Ben Voigt Dec 24 '20 at 18:02
  • Presumably, he looked at the strings that I incorrectly claimed were read in by `fgets`. – MiloDC Dec 24 '20 at 18:40
  • `\xa0` is a non-breaking space, which explains why a UTF-8 sequence ending in it being stripped of it by a non-UTF-8-aware whitespace-stripping function. – Mark Tolonen Dec 25 '20 at 21:01
  • @BenVoigt I'm sure Raymond just encoded the expected values to UTF-8 to see what the bytes where suposed to be, e.g. `我爱你 => "\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0"` and `コム => "\xe3\x82\xb3\xe3\x83\xa0"` – Mark Tolonen Dec 25 '20 at 21:05
  • @MiloDC: I have very little faith that the characters would be perfectly preserved in a Stack Overflow question, you've got the web browser posting the question, storage and retrieval to some database, and then the web server and browser viewing it again. – Ben Voigt Dec 26 '20 at 03:48