1

Due to some bug in a C extension, I'm getting unicode data with str instances, or in order words, str with no encoding at all and an unicode literal.

So, for instance, this is a valid unicode literal

>>> u'\xa1Se educado!'

And the UTF-8 encoded str would be:

>>> '\xc2\xa1Se educado!'

However, I get an str with the unicode literal:

>>> '\xa1Se educado!'

And I need to create an unicode instance from that. Using unicode() doesn't work, since it expects an encoding. I figured that ''.join(unichr(ord(x)) for x in s) does what I need, but it's really ugly. There has to be a better solution. Any ideas?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Pedro Werneck
  • 40,902
  • 7
  • 64
  • 85

2 Answers2

1

As I suspected, there has to be a way to decode it with whatever "encoding" python uses for unicode, and that's raw_unicode_escape.

>>> unicode('\xa1Se educado!', 'raw_unicode_escape')
u'\xa1Se educado!'
Pedro Werneck
  • 40,902
  • 7
  • 64
  • 85
1

I get an str with the unicode literal: '\xa1Se educado!'

Not really, \xa1 is not a Unicode-specific escape. \xa1 in a byte string means byte number 161 and \xa1 in a Unicode string means character (code point) number 161—same as \u00A1.

What you have is a byte string containing an ISO-8859-1 encoding of ¡Se educado! instead of the UTF-8 encoding. In the ISO-8859-1 encoding each byte number happens to match the Unicode character of the same code point number. To decode an ISO-8859-1 byte string to a Unicode string use:

>>> '\xa1Se educado!'.decode('iso-8859-1')
u'\xa1Se educado!'

although actually if you are using Windows then the encoding is likely to be code page 1252 ('windows-1252') rather than ISO-8859-1. They're similar encodings but not quite the same. Code page 1252 is the default ‘ANSI’ code page that Windows uses for non-Unicode applications in the Western European and US locales. If you are getting this data from a Windows non-Unicode application running on the same machine, you should decode it using the encoding 'mbcs' which corresponds to whatever the locale-specific default code page is.

These are legacy encodings that cannot hold all Unicode characters. You will probably find the C extension cannot cope with characters outside of the current code page set at all.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Nope. The example was poor in that it matches the ISO-8859-1, but as soon as I have characters exclusive to unicode, it breaks and I get escaped \u sequences. For instance, u'€95.00' is coming up as '\u20ac95.00'. I'm pretty sure someone is writing raw python unicode somehow. Thanks for the help anyway. – Pedro Werneck May 15 '14 at 16:10
  • There's no `\u` escape in a byte string—do you mean to say you have `'\\u20ac95.00'`? And yet you have `'\xa1'` (ie a literal byte 161, not `'\\xa1'`) for characters U+0000 to U+00FF? – bobince May 15 '14 at 16:54