I need to read text file which contains strings in arbitrary MBCS encodings. Format of file (simplfied) is like this:
CODEPAGE "STRING"
CODEPAGE STRING
...
where CODEPAGE can be any MBCS codepage: UTF-8, cp1251 (Cyrillic), cp932 (Japanese), etc.
I can't decode the whole file in one call to MultiByteToWideChar. I need to extract string between quotes or until space or carriage return and call MultiByteToWideChar on extracted string.
But in MBCS (multi-byte coding schemes) one character can be represented with more than one byte. If I want to find latin 'A' in multi-byte encoded file, I can't just search for code 65 because 65 can be trailing byte in some encoding sequence.
So I'm not sure if I'm allowed to search for '"' or space or CR in MBCS string. I browsed several codepages (for exapmple Chinese 936 codepage: https://ssl.icu-project.org/icu-bin/convexp?conv=windows-936-2000&s=ALL) and as far as I see all trailing bytes starts from 0x40 so it's safe to scan file for punctuation characters. But is there some guarantee for that for any codepage?