4

I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.

So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

It seems IronPython isn't a fan either:

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

If somebody could tell me what, exactly, is going on here, it'd be much appreciated.

NoName
  • 125
  • 5
  • Nicely asked question... the link to an image of the expected character is a nice touch. – Jarret Hardie Feb 15 '10 at 21:53
  • Encoding something in UTF-16 and then decoding using UTF-8 is unlikely to produce sensible results. At best -- if the input is ASCII encodable -- you get a sensible character every second one :) – Thomas Wouters Feb 15 '10 at 22:09
  • Yep, that last line was a mistype that confused me greatly. Thanks. – NoName Feb 15 '10 at 22:19

4 Answers4

2

This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.

The '\xff\xfe' you can see is just a Byte Order Mark.

Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Well, it's coming from a barely documented AD LDAP attribute called userParameters, The reason I noticed it is the field has both 0x00 and the '\xe3\x80\xb0' combo (right near each other, actually...). I suppose it's possible that microsoft isn't encoding things correctly. – NoName Feb 15 '10 at 22:01
  • Perhaps it's clearer if you write it as `'\x30\x30'` instead of `'00'`? Different notation, same string. – Thomas Wouters Feb 15 '10 at 22:03
  • @NoName: It's possible that they are using \x00 as a delimiter - I'm not familiar with the protocol so it's just a guess. Assuming it's not sensitive information, you might want to post the entire string here as it might give us some more hints. – Mark Byers Feb 15 '10 at 22:09
  • Thanks for the help, yeah, it was definitely a misunderstanding. The data is a utf8 string that needs to be encoded as utf-16-le to read it as packed binary, one of the values contains ascii "30000000...0x00" which itself is a hex string meant to be interpreted as memory / a struct that when itself is hex decoded becomes the ascii string '0' which should then be decoded into an integer. You can see why I was confused ;) – NoName Feb 15 '10 at 22:17
2

But it decodes okay:

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':

>>> '00' == '\x30\x30'
True
huin
  • 161
  • 3
1

You are being confused by two things here (threw me off too):

  1. utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
  2. '00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:

    >>> '\0\0'
    '\x00\x00'
    
Rhamphoryncus
  • 339
  • 1
  • 6
0

There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

which translates to the following steps:

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.

Remember, you can't reach to the same character if you encode with one and decode with another encoding:

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.

tzot
  • 92,761
  • 29
  • 141
  • 204