Can python process multiple Chinese encodings in one string?

Question

Input string has chinese words, some encoded by UTF8, some by GB2312. Could python process such string with multiple encodings?

Give us an example. A string is either a *sequence of characters* (in this case the encoding doesn’t matter), or a *sequence of bytes* (which should only have a *single* encoding). So we need examples to understand you. At least three. — Roland Illig, Nov 08 '13 at 08:24
The OP's question seems to imply a single sequence of bytes that contains a mixture of bytes created by UTF-8 encoding some Chinese characters and GB2312 encoding some Chinese characters. If that's what's going on...well, it might be possible for some input but it's gonna be ugly and unreliable. — Peter DeGlopper, Nov 08 '13 at 08:34
@PeterDeGlopper: according to http://en.wikipedia.org/wiki/GB_2312 it is not character encoding (in a similar sense that Unicode is not character encoding) i.e., GB2312 codepoints can be represented differently as bytes using EUC-CN or HZ encodings (in the same way as Unicode codepoints can be represented differently as bytes using UTF-8, UTF-16, and UTF-32 encodings). Given self-synchronizing property of utf-8 it *might* be possible to extract corresponding byte sequences depending on how GB2312 is represented. — jfs, Nov 08 '13 at 08:51
@J.F.Sebastian I think that Wikipedia is wrong. In practice, the two terms EUC-CN and GB2312 are equal. There are official standards. HZ as used by in very small group. — Leonardo.Z, Nov 08 '13 at 09:08
@Leonardo.Z: Python also treats `'gb2312'` as an encoding. And judging by its aliases: [chinese, csiso58gb231280, euc- cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso- ir-58](http://docs.python.org/2/library/codecs.html); it is the same encoding as `'euccn'`. — jfs, Nov 08 '13 at 09:39

Can python process multiple Chinese encodings in one string?

0 Answers0