Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Question

Well, let me introduce the problem first.

I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.

Example code:

unicode_str - this is the string I should obtain

encoded_str - this is the string I got with POST/GET requests - initial data

bad_str - the data I have in the Database at the moment and I need to get unicode from.

So apparently I know how to convert: unicode_str =(encode)=> encoded_str =(str)=> bad_str

But I couldn't come up with solution back: bad_str =(???)=> encoded_str =(decode)=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???

Reti43 · Accepted Answer · 2017-11-16T18:39:20.493

14

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

edited Nov 16 '17 at 18:39

answered Nov 16 '17 at 12:31

Reti43

9,656
3
28
44

well, I had `eval` also in mind, but since I don't know what data is there and there is a lot of data, I was hoping I could evade this - thus not mentioning it. But thanks :) – darkless Nov 16 '17 at 12:49
4

@darkless It doesn't matter what the strings you saved look like. As long as you followed the procedure of get utf-8 string -> encode it to a bytes object -> turn **this** to a string and store to your database, you guarantee those strings to be harmless bytes objects. – Reti43 Nov 16 '17 at 12:51
True, I didn't realized that every stored string is "b'...' " which eval should interpret as b'...' :) Thanks for the remark! – darkless Nov 16 '17 at 14:17
3

`ast.literal_eval` does the same thing without the security risk of `eval`. – Mark Tolonen Nov 16 '17 at 18:19
@MarkTolonen This is excellent. I have updated the answer accordingly. Also, this function sounds a bit familiar and I wouldn't be surprised if I hadn't "learnt" about it before but forgot, since I have never needed it. Need better memory for next time! – Reti43 Nov 16 '17 at 18:40
1

@darkless I just want to draw your attention to the fact that there is a safer version of `eval()` which also fits your requirements. – Reti43 Nov 16 '17 at 18:46

score 6 · Answer 2 · answered Nov 17 '17 at 14:03

6

Please do not use eval, instead:

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s

answered Nov 17 '17 at 14:03

Honza Král

2,982
14
11

1

Thanks for non-eval version, I was missing `escape_decode` to turn double-slashes to single-slashes. I can't find docs to the method though: https://docs.python.org/3.5/library/codecs.html – darkless Nov 20 '17 at 20:33
Wow this is very nice to use. – Franco Gil Mar 17 '20 at 14:12

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Example code:

2 Answers2