How to create an unicode instance from an unicode literal

Question

Due to some bug in a C extension, I'm getting unicode data with str instances, or in order words, str with no encoding at all and an unicode literal.

So, for instance, this is a valid unicode literal

>>> u'\xa1Se educado!'

And the UTF-8 encoded str would be:

>>> '\xc2\xa1Se educado!'

However, I get an str with the unicode literal:

>>> '\xa1Se educado!'

And I need to create an unicode instance from that. Using unicode() doesn't work, since it expects an encoding. I figured that ''.join(unichr(ord(x)) for x in s) does what I need, but it's really ugly. There has to be a better solution. Any ideas?

what version of Python you are working with? What is the extension which makes the problems? Could you correct it there? — Jan Vlcinsky, May 14 '14 at 23:16

score 1 · Accepted Answer · answered May 14 '14 at 23:33

1

As I suspected, there has to be a way to decode it with whatever "encoding" python uses for unicode, and that's raw_unicode_escape.

>>> unicode('\xa1Se educado!', 'raw_unicode_escape')
u'\xa1Se educado!'

answered May 14 '14 at 23:33

Pedro Werneck

40,902
7
64
85

score 1 · Answer 2 · answered May 15 '14 at 15:16

I get an str with the unicode literal: '\xa1Se educado!'

Not really, \xa1 is not a Unicode-specific escape. \xa1 in a byte string means byte number 161 and \xa1 in a Unicode string means character (code point) number 161—same as \u00A1.

What you have is a byte string containing an ISO-8859-1 encoding of ¡Se educado! instead of the UTF-8 encoding. In the ISO-8859-1 encoding each byte number happens to match the Unicode character of the same code point number. To decode an ISO-8859-1 byte string to a Unicode string use:

>>> '\xa1Se educado!'.decode('iso-8859-1')
u'\xa1Se educado!'

although actually if you are using Windows then the encoding is likely to be code page 1252 ('windows-1252') rather than ISO-8859-1. They're similar encodings but not quite the same. Code page 1252 is the default ‘ANSI’ code page that Windows uses for non-Unicode applications in the Western European and US locales. If you are getting this data from a Windows non-Unicode application running on the same machine, you should decode it using the encoding 'mbcs' which corresponds to whatever the locale-specific default code page is.

These are legacy encodings that cannot hold all Unicode characters. You will probably find the C extension cannot cope with characters outside of the current code page set at all.

Nope. The example was poor in that it matches the ISO-8859-1, but as soon as I have characters exclusive to unicode, it breaks and I get escaped \u sequences. For instance, u'€95.00' is coming up as '\u20ac95.00'. I'm pretty sure someone is writing raw python unicode somehow. Thanks for the help anyway. — Pedro Werneck, May 15 '14 at 16:10
There's no `\u` escape in a byte string—do you mean to say you have `'\\u20ac95.00'`? And yet you have `'\xa1'` (ie a literal byte 161, not `'\\xa1'`) for characters U+0000 to U+00FF? — bobince, May 15 '14 at 16:54

How to create an unicode instance from an unicode literal

2 Answers2