4

I was playing around with python's unicode and encoding methods, I used the special character "‽" and a Chinese character to see how different utf encoding deal with these characters, and I get these output.

>>> a = u"‽"
>>> encoded_a = a.encode('utf-32')
>>> a
u'\u203d'
>>> encoded_a
'\xff\xfe\x00\x00= \x00\x00'
>>> b = u"安"
>>> encoded_b = b.encode('utf-32')
>>> b
u'\u5b89'
>>> encoded_b
'\xff\xfe\x00\x00\x89[\x00\x00'

My question is what does the equal sign and the equare bracket mean in the encoded result?

David Zheng
  • 797
  • 7
  • 21

3 Answers3

2

"\xff\xfe\x00\x00" is a zero-width, nonblocking space character, better known for its use as byte order marker (BOM). Beats me why Python inserts this into the string, but I'm sure there's a way to only request the encoding of the given string, not a prefix for other programs to recognize this as UTF-32.

This is followed by the bytes 3d, 20 and two more nulls, which represent the codepoint 203d in little endian byte order. 3d, when interpreted as ASCII, becomes the equals sign and 20 becomes the space character.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • 1
    Python inserts the BOM if you use an encoding that's greater than 8 bits without specifying endianness. To lose the BOM use `'utf-32le'` or `'utf-32be'`. – Mark Ransom May 19 '16 at 03:10
1

When you print the repr of a byte string, any byte value in the range of \x20 through \x7e will be converted to an equivalent ASCII printable character. In this case, = is the same as \x3d and [ is the same as \x5b. You missed the space, which is \x20.

Here's the complete table:

\x20 ' '    \x21 '!'    \x22 '"'    \x23 '#'
\x24 '$'    \x25 '%'    \x26 '&'    \x27 '''
\x28 '('    \x29 ')'    \x2a '*'    \x2b '+'
\x2c ','    \x2d '-'    \x2e '.'    \x2f '/'
\x30 '0'    \x31 '1'    \x32 '2'    \x33 '3'
\x34 '4'    \x35 '5'    \x36 '6'    \x37 '7'
\x38 '8'    \x39 '9'    \x3a ':'    \x3b ';'
\x3c '<'    \x3d '='    \x3e '>'    \x3f '?'
\x40 '@'    \x41 'A'    \x42 'B'    \x43 'C'
\x44 'D'    \x45 'E'    \x46 'F'    \x47 'G'
\x48 'H'    \x49 'I'    \x4a 'J'    \x4b 'K'
\x4c 'L'    \x4d 'M'    \x4e 'N'    \x4f 'O'
\x50 'P'    \x51 'Q'    \x52 'R'    \x53 'S'
\x54 'T'    \x55 'U'    \x56 'V'    \x57 'W'
\x58 'X'    \x59 'Y'    \x5a 'Z'    \x5b '['
\x5c '\'    \x5d ']'    \x5e '^'    \x5f '_'
\x60 '`'    \x61 'a'    \x62 'b'    \x63 'c'
\x64 'd'    \x65 'e'    \x66 'f'    \x67 'g'
\x68 'h'    \x69 'i'    \x6a 'j'    \x6b 'k'
\x6c 'l'    \x6d 'm'    \x6e 'n'    \x6f 'o'
\x70 'p'    \x71 'q'    \x72 'r'    \x73 's'
\x74 't'    \x75 'u'    \x76 'v'    \x77 'w'
\x78 'x'    \x79 'y'    \x7a 'z'    \x7b '{'
\x7c '|'    \x7d '}'    \x7e '~'

Your two strings are actually '\xff\xfe\x00\x00\x3d\x20\x00\x00' and '\xff\xfe\x00\x00\x89\x5b\x00\x00'.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

The first two hex encodings represent the BOM, or Byte Order Mark. Looking at the Python documentation for Unicode it would appear that the characters you are seeing are the translation of the hex encoding. I am looking at one of the examples provided in the documentation, which appears to be doing the same thing you are and is printing out the translation:

8 >>> unistring.encode('utf-16')
9 '\xff\xfeH\x00i\x00\n\x00'