How to convert percent-encoded url to string with non-ascii chars?

Question

This should be an easy one I hope. I have a url:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg

that is saved into a json file with this code:

paintings = get_all_paintings(marc_chagall)
with open('chagall.json', 'w') as fb:
    x = json.dump(paintings, fb)

In the file, the URL has become:

u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

I am able to get the original, usable, percent-encoded URL with this code:

p = u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
p = urllib.quote(p.encode('utf8'), safe='/:')
print repr(p) 
> 'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg'

Now comes the tricky part. I want to get this string:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

with the non-ascii character in napoléon intact. This is for naming purposes in the storage bucket, not for anything else. How can I produce this string?

score 4 · Accepted Answer · answered Nov 11 '14 at 12:17

4

Just print the unicode value:

>>> print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

Don't confuse the python representation of the Unicode value (which is deliberately using escapes for non-ASCII characters for ease of debugging and introspection) with the actual value.

Printing encodes the value to the codec used by your console or terminal, provided Python was able to detect it. My terminal is set to UTF-8, so Python encoded the U+00E9 unicode code point to C3 A9 bytes and my terminal then interpreted that as UTF-8 and displayed the é.

This all just means that you already have the right value, but were thrown by the debugging output.

answered Nov 11 '14 at 12:17

Martijn Pieters

1,048,767
296
4,058
3,343

I want to save the last part to a variable, like `x.split('/')[-1]` – ian-campbell Nov 11 '14 at 12:18
@edmund_spenser: then just do so. Unicode strings support splitting just like byte strings do. – Martijn Pieters Nov 11 '14 at 12:20
I was really thrown by, like you said, the python representation of the Unicode value. I didn't realize what I had. – ian-campbell Nov 11 '14 at 15:10

score 1 · Answer 2 · answered Nov 11 '14 at 12:17

1

You already have it:

print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

The value of p already is already that string, it's only displayed differently.

answered Nov 11 '14 at 12:17

Simeon Visser

118,920
18
185
180

That prints it to the console, but how do I save it to a variable and store it? – ian-campbell Nov 11 '14 at 12:18
@edmund_spenser: the variable `p` already contains the string you want (exactly), it's only displayed differently (the sequence `\xe9` is the character you want). – Simeon Visser Nov 11 '14 at 12:19

How to convert percent-encoded url to string with non-ascii chars?

2 Answers2