-1

Using Python 3.4, suppose I have some data from a file, and it is literally the 6 individual characters \ u 0 0 C 0 but I need to convert it to the single unicode character \u00C0. Is there a simple way of doing that conversion? I can't find anything in the Python 3.4 Unicode documentation that seems to provide that kind of conversion, except for a complex way using exec() of an assignment statement which I'd like to avoid if possible.

Thanks.

  • possible duplicate of [How do convert unicode escape sequences to unicode characters in a python string](http://stackoverflow.com/questions/990169/how-do-convert-unicode-escape-sequences-to-unicode-characters-in-a-python-string) – tripleee Jan 25 '15 at 15:12

1 Answers1

0

Well, there is:

>>> b'\\u00C0'.decode('unicode-escape')
'À'

However, the unicode-escape codec is aimed at a particular format of string encoding, the Python string literal. It may produce unexpected results when faced with other escape sequences that are special in Python, such as \xC0, \n, \\ or \U000000C0 and it may not recognise other escape sequences from other string literal formats. It may also handle characters outside the Basic Multilingual Plane incorrectly (eg JSON would encode U+10000 to surrogates\uD800\uDC00).

So unless your input data really is a Python string literal shorn of its quote delimiters, this isn't the right thing to do and it'll likely produce unwanted results for some of these edge cases. There are lots of formats that use \u to signal Unicode characters; you should try to find out what format it is exactly, and use a decoder for that scheme. For example if the file is JSON, the right thing to do would be to use a JSON parser instead of trying to deal with \u/\n/\\/etc yourself.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • I just tried Python 3.3 and 3.4 and `unicode-escape` *does* do `\xc0`, `\n` and \\. – Mark Tolonen Jan 25 '15 at 10:20
  • @Mark: you're right, the behaviour of the `unicode-escape` codec has changed since its introduction in Py2. I'll reword the answer. – bobince Jan 25 '15 at 14:44
  • Thanks. That solves it for me. Given that my data is in variable x, I can use bytes(x,"utf-8").decode('unicode-escape') and get exactly what I need. – Walt Farrell Jan 25 '15 at 18:26