0

Input string has chinese words, some encoded by UTF8, some by GB2312. Could python process such string with multiple encodings?

shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90
David Wang
  • 61
  • 2
  • 2
    Give us an example. A string is either a *sequence of characters* (in this case the encoding doesn’t matter), or a *sequence of bytes* (which should only have a *single* encoding). So we need examples to understand you. At least three. – Roland Illig Nov 08 '13 at 08:24
  • The OP's question seems to imply a single sequence of bytes that contains a mixture of bytes created by UTF-8 encoding some Chinese characters and GB2312 encoding some Chinese characters. If that's what's going on...well, it might be possible for some input but it's gonna be ugly and unreliable. – Peter DeGlopper Nov 08 '13 at 08:34
  • @PeterDeGlopper: according to http://en.wikipedia.org/wiki/GB_2312 it is not character encoding (in a similar sense that Unicode is not character encoding) i.e., GB2312 codepoints can be represented differently as bytes using EUC-CN or HZ encodings (in the same way as Unicode codepoints can be represented differently as bytes using UTF-8, UTF-16, and UTF-32 encodings). Given self-synchronizing property of utf-8 it *might* be possible to extract corresponding byte sequences depending on how GB2312 is represented. – jfs Nov 08 '13 at 08:51
  • @J.F.Sebastian I think that Wikipedia is wrong. In practice, the two terms EUC-CN and GB2312 are equal. There are official standards. HZ as used by in very small group. – Leonardo.Z Nov 08 '13 at 09:08
  • @Leonardo.Z: Python also treats `'gb2312'` as an encoding. And judging by its aliases: [chinese, csiso58gb231280, euc- cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso- ir-58](http://docs.python.org/2/library/codecs.html); it is the same encoding as `'euccn'`. – jfs Nov 08 '13 at 09:39

0 Answers0