0

There are some special chinese words like '觱' '踨', when I check its code point of gb18030 as follow.

>>>u'觱'.encode('gb18030')
'\xd3v'

I have been confused about the code point '\xd3v'. It's not a correct hex-digits.
Who can explain it clearly?

Actually, I have a task that converting code points of gb18030,like 'CDF2' 'F4A5' etc..., into
its corresponding unicode encoding.

>>>'CDF2'.decode('hex').decode('gb18030')
u'\u4e07'

But,

>>>'d3v'.decode('hex').decode('gb18030')

Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/encodings/hex_codec.py", line 42, in hex_decode
        output = binascii.a2b_hex(input)
    TypeError: Odd-length string

So, I don't understand why the encode method return an non-hex code point.
For example, what's the meaning 'v' of '\xd3v'?

Qinghua
  • 351
  • 3
  • 10

2 Answers2

1

'\xd3v' == '\xd3\x76'. Python prints ASCII printables (including \n, \t, ...) as a letter instead of hexadecimal form.

>>> '\xd3v' == '\xd3\x76'
True

If you want to get hexadeicmal format, use encode('hex') (as you did for decode)

>>> u'觱'.encode('gb18030').encode('hex')
'd376'

or using binascii.hexlify:

>>> binascii.hexlify(u'觱'.encode('gb18030'))
'd376'
falsetru
  • 357,413
  • 63
  • 732
  • 636
0

Just a "v" - the character encoded in the "gb18030" encoding is represented by two bytes, one being "\xd3" - dec 211 - and the other one being dec-118. The default behavior for Python 2.x when showing a byte-string representation is to display bytes in the ASCII rabge of 32-127 as their ASCII encoding, and characters outside this range as 2 digit heexadecimal escapes.

Thus:
>>> a = u'觱'.encode('gb18030')
>>> ord(a[0])
211
>>> ord(a[1])
118

Now, if you are editing this in a gb18030 terminal, just seeing the actual STR representation, instead of repr, would show you the original chinese character.

>>> print a
jsbueno
  • 99,910
  • 10
  • 151
  • 209