Python3 - Convert unicode literals string to unicode string

Question

From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'

For example this script uni.py:

import sys
print(sys.argv[1])

command line:

python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021

output:

\u041f\u0440\u0438\u0432\u0435\u0442\u0021

I want to convert it to unicode string 'Привет!'

Please clarify what you want to do. `'\u041f\u0440\u0438\u0432\u0435\u0442\u0021'` *is* the string `'Привет!'`. — MisterMiyagi, Mar 15 '20 at 12:40
To clarify the above: that representation is Python's representation *only*, because some terminals cannot print Unicode. Do this simple experiment: print out the ordinal value of the first character. You will see it is `1055` (`0x41f` in decimal), and not `92`, the value for a backslash (nor `39` – the single quote – because that is *also* not "part of the string", even though it gets printed by Python as well). — Jongware, Mar 15 '20 at 13:51

wovano · Accepted Answer · 2020-03-15T15:26:10.480

You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.

If you just print the string, you'll get the correct result, assuming your terminal supports the characters.

print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')

This will print:

Привет!

UPDATE

After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want. Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.

The following code will print the correct result for your example:

text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)

UPDATE 2

Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:

text = ast.literal_eval("'" + sys.argv[1] + "'")

But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.

Python3 - Convert unicode literals string to unicode string

1 Answers1