Decoding UTF-8 URL in Python

Question

I have a string like "pe%20to%C5%A3i%20mai". When I apply urllib.parse.unquote to it, I get "pe to\u0163i mai". If I try to write this to a file, I get those exact simbols, not the expected glyph.

How can I transform the string to utf-8 so in the file I have the proper glyph instead?

Edit: I'm using Python 3.2

Edit2: So I figured out that the urllib.parse.unquote was working correctly, and my problem actually is that I'm serializing to YAML with yaml.dump and that seems to screw things up. Why?

You're doing many things here. You're 1) writing a string literal in your program, 2) decoding it using a library function, 3) writing it to a file, 4) reading it back from the file and 5) printing it. In every single one of those steps there are character encoding issues. If any one of the steps contains an error the final result will be wrong. Instead of doing 5 things at once, split your problem up into five smaller problems. Test that the correct thing happens for every step. Check the intermediate results. Determine which of the five steps is the one that doesn't work. And post your code. — Mark Byers, Aug 13 '12 at 18:14

jfs · Accepted Answer · 2012-08-13T19:04:55.867

Update: If the output file is a yaml document then you could ignore \u0163 in it. Unicode escapes are valid in yaml documents.

#!/usr/bin/env python3
import json

# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"

Note: no \u in the last case. Both lines represent the same Python string.

yaml.dump() has similar option: allow_unicode. Set it to True to avoid Unicode escapes.

The url is correct. You don't need to do anything with it:

#!/usr/bin/env python3
from urllib.parse import unquote

url =  "pe%20to%C5%A3i%20mai"
text = unquote(url)

with open('some_file', 'w', encoding='utf-8') as file:
    def p(line):
        print(line, file=file) # write line to file

    p(text)                # -> pe toţi mai
    p(repr(text))          # -> 'pe toţi mai'
    p(ascii(text))         # -> 'pe to\u0163i mai'

    p("pe to\u0163i mai")  # -> pe toţi mai
    p(r"pe to\u0163i mai") # -> pe to\u0163i mai
    #NOTE: r'' prefix

The \u0163 sequence might be introduced by character encoding error handler:

with open('some_other_file', 'wb') as file: # write bytes
    file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai

Or:

with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
    file.write(text) # -> pe to\u0163i mai

More examples:

# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ

Mark Byers · Answer 2 · 2012-08-13T18:08:16.257

Python 3

Calling urllib.parse.unquote returns a Unicode string already:

>>> urllib.parse.unquote("pe%20to%C5%A3i%20mai")
'pe toţi mai'

If you don't get that result, it must be an error in your code. Please post your code.

Python 2

Use decode to get a Unicode string from a bytestring:

>>> import urllib2
>>> print urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
pe toţi mai

Remember that when you write a Unicode string to a file you have to encode it again. You could choose to write to the file as UTF-8, but you could also choose a different encoding if you wished. You also have to remember to use the same encoding when reading back from the file. You may find the codecs module useful for specifying an encoding when reading from and writing to files.

>>> import urllib2, codecs
>>> s = urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')

>>> # Write the string to a file.
>>> with codecs.open('test.txt', 'w', 'utf-8') as f:
...     f.write(s)

>>> # Read the string back from the file.
>>> with codecs.open('test.txt', 'r', 'utf-8') as f:
...     s2 = f.read()

One potentially confusing issue is that in the interactive interpreter Unicode strings are sometimes displayed using the \uxxxx notation instead of the actual characters:

>>> s
u'pe to\u0163i mai'
>>> print s
pe toţi mai

This does not mean that the string is "wrong". It's just the way the interpreter works.

[Python 3 works differently](http://stackoverflow.com/a/11939582/4279) e.g., `repr()` is not escaped — jfs, Aug 13 '12 at 18:08

score 1 · Answer 3 · answered Aug 13 '12 at 17:32

1

Try decode using unicode_escape.

E.g.:

>>> print "pe to\u0163i mai".decode('unicode_escape')
pe toţi mai

answered Aug 13 '12 at 17:32

Maria Zverina

10,863
3
44
61

1

It says `AttributeError: 'str' object has no attribute 'decode'` – rolisz Aug 13 '12 at 17:54

score 1 · Answer 4 · answered Aug 13 '12 at 18:39

1

The urllib.parse.unquote returned a correct UTF-8 string and writing that straight to the file returned did the expected result. The problem was with yaml. By default it doesn't encode with UTF-8.

My solution was to do:

yaml.dump("pe%20to%C5%A3i%20mai",encoding="utf-8").decode("unicode-escape")

Thanks to J.F. Sebastian and Mark Byers for asking me the right questions that helped me figure out the problem!

answered Aug 13 '12 at 18:39

rolisz

10,794
2
19
14

I will use it. And I will read more carefully the yaml documentation next time :)) – rolisz Aug 13 '12 at 18:58

Decoding UTF-8 URL in Python

4 Answers4