0

I have a string of unicode ordinals (in hex form) like so:

\u063a\u064a\u0646\u064a\u0627

It's the unicode repsentation of the Arabic string غينيا (gotten of an Arabic lorem ipsum generator).

I want to convert the unicode hex string to غينيا. I tried print u'%s' % "\u063a\u064a\u0646\u064a\u0627" (pointed out here) but that simply returns the hex format, not the symbols. print word.replace("\u","\\u") doesn't do the job either. What to do?

Hassan Baig
  • 15,055
  • 27
  • 102
  • 205
  • 1
    Is `\u063a\u064a\u0646\u064a\u0627` an ascii string, where the backslashes are actually escaped? – Izaak van Dongen Aug 21 '17 at 14:40
  • Where are you outputting the string to? If it is a console, then the console may not have full unicode support. – Jake Conkerton-Darby Aug 21 '17 at 14:41
  • @IzaakvanDongen: actually not escaped. Should I run a quick `s.replace("\u", "\\u")` on the hex string before trying to print it? – Hassan Baig Aug 21 '17 at 14:43
  • Still not clear what you actually have. Do you have that value in a variable? What does `len` say the length is? Better provide actual Python code we can play with. – Stefan Pochmann Aug 21 '17 at 14:44
  • @JakeConkerton-Darby: good question. Yes, indeed to the console - but my ultimate aim is to **draw** this text on top of an image using `PIL` (see here: https://stackoverflow.com/questions/45675525/using-pillow-to-draw-cursive-text) – Hassan Baig Aug 21 '17 at 14:44
  • @StefanPochmann: the hex value is actually a variable containing exactly what you see: `\u063a\u064a\u0646\u064a\u0627`. – Hassan Baig Aug 21 '17 at 14:45
  • Initialize the string like this: `s = u"\u063a\u064a\u0646\u064a\u0627"`. When dealing with strings a quick way to spot possible escaping problems is checking their length (and sometimes type): `len(s)` (which should return _5_ for this string), `type(s)` (which should return _unicode_). You could also use [\[Python\]: repr(object)](https://docs.python.org/2/library/functions.html#repr). – CristiFati Aug 21 '17 at 14:47

1 Answers1

1

I'm not entirely sure from the question what you want, so I'll cover both cases I can see.

Case 1: You just want to output the arabic string from your code, using the unicode literal syntax. In this case, you should prefix your string literal with a u and you'll be right as rain:

s = u"\u063a\u064a\u0646\u064a\u0627"
print(s)

This would probably do the same as

print u'%s' % s

except shorter. In this case, formatting an otherwise empty string into your formed string doesn't make any sense, because it's not changing anything - in other words, u'%s' % s == s.

Case 2: You have an escaped string from some other source that you want to evaluate as a Unicode string. This is kind of what it looks like you're trying to do with print u'%s' %. This can be done with

import ast
s = r"\u063a\u064a\u0646\u064a\u0627"
print ast.literal_eval("u'{}'".format(s))

Note that unlike eval this is safe, as literal_eval doesn't allow anything like a function call. Also see that s here is an r-prefixed string, so the backslashes aren't escaping anything but are literally backslash characters.

Both pieces of code correctly output

غينيا

Some elaboration on print u'%s' % s for case 1. This behaves differently, because if the string has already been escaped, it won't be evaluated like a Unicode literal in the formatting. This is because Python only actually builds Unicode out of unicode literal-like expressions (such as s) when they are at first evaluated. If it has been escaped, this is kind of out of reach by using normal string operations, so you have to use literal_eval to evaluate it again in order to properly print the string. When you run

print u'%s' % s

the output is

\u063a\u064a\u0646\u064a\u0627

Note that this isn't a representation of a Unicode object but literally an ascii string with some backslashes and characters.

Izaak van Dongen
  • 2,450
  • 13
  • 23