Is it guaranteed that trailing bytes in mbcs encodings are in specific range?

Question

I need to read text file which contains strings in arbitrary MBCS encodings. Format of file (simplfied) is like this:

CODEPAGE "STRING"
CODEPAGE STRING
...

where CODEPAGE can be any MBCS codepage: UTF-8, cp1251 (Cyrillic), cp932 (Japanese), etc.

I can't decode the whole file in one call to MultiByteToWideChar. I need to extract string between quotes or until space or carriage return and call MultiByteToWideChar on extracted string.

But in MBCS (multi-byte coding schemes) one character can be represented with more than one byte. If I want to find latin 'A' in multi-byte encoded file, I can't just search for code 65 because 65 can be trailing byte in some encoding sequence.

So I'm not sure if I'm allowed to search for '"' or space or CR in MBCS string. I browsed several codepages (for exapmple Chinese 936 codepage: https://ssl.icu-project.org/icu-bin/convexp?conv=windows-936-2000&s=ALL) and as far as I see all trailing bytes starts from 0x40 so it's safe to scan file for punctuation characters. But is there some guarantee for that for any codepage?

daxim · Accepted Answer · 2019-07-31T11:49:42.107

1

Analyse which octets can occur in encoded octet sequences, discarding the leading one. Result is 0x40..0x7E, 0x80..0xFE.

#!/usr/bin/env perl
use Encode qw(encode);
my @encodings = qw(
    cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256
    cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856
    cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875
    cp932 cp936 cp949 cp950
);
my %continuation_octets;
for my $e (@encodings) {
    for my $c (0..0x10_ffff) {
        my $encoded = encode $e, chr($c), sub { -1 };
        if ($encoded ne -1 && length($encoded) > 1) {
            my @octets = split //, $encoded;
            shift @octets;
            $continuation_octets{$_}++ for @octets;
        }
    }
}

edited Jul 31 '19 at 11:49

answered Jul 31 '19 at 08:30

daxim

39,270
4
65
132

Thank you! But unfortunately this doesn't answer my question. What I want is to scan octet sequence for space, quote or CR and be sure that I will not encounter these characters as part of encoded symbols. I definitely know that I can't do it for 'A' character because code 65 can be part of some other character. But I'm not sure is it true for punctuation characters. Also I can ignore all codepages whcih doesn't directly map asciii-7 characters to themselves. I'm limited only by codepages which can be set as default codepage in Windows for non-unicode programs. And thank you for script! – Michael Ilyin Jul 31 '19 at 10:35
Super! Thank you! I definitely should recall Perl :-) It's good news that it's ok to scan sequence in any listed codepage for cr, space or quote. Though it's sad that "\" is not safe: %continuation_octets = ( "\@" => 307, "[" => 308, "\\" => 305, "]" => 309, "^" => 302, "`" => 307, "{" => 302, "|" => 308, "}" => 306, "~" => 304, "\200" => 177, .... – Michael Ilyin Jul 31 '19 at 13:26

Is it guaranteed that trailing bytes in mbcs encodings are in specific range?

1 Answers1