12

Well, let me introduce the problem first.

I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.

Example code:

unicode_str - this is the string I should obtain

encoded_str - this is the string I got with POST/GET requests - initial data

bad_str - the data I have in the Database at the moment and I need to get unicode from.

So apparently I know how to convert: unicode_str =(encode)=> encoded_str =(str)=> bad_str

But I couldn't come up with solution back: bad_str =(???)=> encoded_str =(decode)=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???
Donald Duck
  • 8,409
  • 22
  • 75
  • 99
darkless
  • 1,304
  • 11
  • 19

2 Answers2

14

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

Reti43
  • 9,656
  • 3
  • 28
  • 44
  • well, I had `eval` also in mind, but since I don't know what data is there and there is a lot of data, I was hoping I could evade this - thus not mentioning it. But thanks :) – darkless Nov 16 '17 at 12:49
  • 4
    @darkless It doesn't matter what the strings you saved look like. As long as you followed the procedure of get utf-8 string -> encode it to a bytes object -> turn **this** to a string and store to your database, you guarantee those strings to be harmless bytes objects. – Reti43 Nov 16 '17 at 12:51
  • True, I didn't realized that every stored string is "b'...' " which eval should interpret as b'...' :) Thanks for the remark! – darkless Nov 16 '17 at 14:17
  • 3
    `ast.literal_eval` does the same thing without the security risk of `eval`. – Mark Tolonen Nov 16 '17 at 18:19
  • @MarkTolonen This is excellent. I have updated the answer accordingly. Also, this function sounds a bit familiar and I wouldn't be surprised if I hadn't "learnt" about it before but forgot, since I have never needed it. Need better memory for next time! – Reti43 Nov 16 '17 at 18:40
  • 1
    @darkless I just want to draw your attention to the fact that there is a safer version of `eval()` which also fits your requirements. – Reti43 Nov 16 '17 at 18:46
6

Please do not use eval, instead:

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s
Honza Král
  • 2,982
  • 14
  • 11
  • 1
    Thanks for non-eval version, I was missing `escape_decode` to turn double-slashes to single-slashes. I can't find docs to the method though: https://docs.python.org/3.5/library/codecs.html – darkless Nov 20 '17 at 20:33
  • Wow this is very nice to use. – Franco Gil Mar 17 '20 at 14:12