Python: URL encode urls with latin characters

Question

I have many entities in data base with an "url" attribute, the url attribute in so many records is hardcoded, i.e containig latin characters, which doesn't work in Firefox (the urls are for song files stored in s3 and I play them with soundmanager2).

Example:

url with latin character "ó": https://something.s3.amazonaws.com/music/something/thisó.mp3

If I replace "ó" with its utf8 code "%c3%b3" then https://something.s3.amazonaws.com/music/something/this%c3%b3.mp3 works

I would like to replace all latin and special character with their url encoding utf8 codes based on this chart

I tried opening https://something.s3.amazonaws.com/music/something/this%c3%b3.mp3 in firefox and it converted it to https://something.s3.amazonaws.com/music/something/thisó.mp3 and then showed and AccessDenied page in XML. Based on this it looks like replacing Latin special chars with utf8 codes does not solve the problem. — , Sep 03 '15 at 17:15
@TrisNefzger yes the browser replace it correctly, but with soundmanager (under firefox) it doesn't play. I found a solution with urllib.quote, if you want I can share it. Thank you :-) — Yahya Yahyaoui, Sep 03 '15 at 17:50
Glad you found a solution. From what you wrote, guess it is url = urllib.quote(url). On Python 2 that resulted in 'https%3A//something.s3.amazonaws.com/music/something/this%C3%B3.mp3, but there is no urllib.quote() for Python3 however urllib.request.quote(url) gives the same result. Please share your solution in an Answer for all. Thanks. — , Sep 03 '15 at 18:49
@Survivor: If you found a solution post it as your own answer, wait some time and accept your own answer in order to close your question. — albert, Sep 03 '15 at 19:08

Yahya Yahyaoui · Accepted Answer · 2015-09-04T01:18:22.700

As asked by @albert, I'm posting the solution I found. Using "urllib"'s "quote" methode, you can encode latin and charactes like " ", "(" and all other special characters. Since "quote" will convert "http:" to "http%3A" which is not desired, It was mandatory to split the url and only convert the wanted part. Another thing that you should consider is if the urls already partially or completely encoded, in this case, the url may contain some utf8 coded characters, which would contain "%", the quote will proceed "%" as a special character and wil convert it to "%25" which will mess the urls to non returning mess !

Example of the case:

If the url is url = "http://something/cóntaining space song name.mp3"

If the url is already partially encoded (e.g " " will be "%20"), then the current url may look like this

url = "http://something/cóntaining%20space%20song%20name.mp3"

urllib.quote(url) will give (let's assume that "http:" is not converted to "http:%3A") the urllib.quote will give:

"http://something/c%C3%B3ntaining%2520space%2520song%2520name.mp3"

The result is a mess !

With that being said; we can't split the url into "http:" and the rest of it and then apply "quote" to the second part of the url.

So the solution; Encode these special characters one by one; replace each latin or special character with its utf code. Then comes the question "How ?"

It is painful to try if each url contains a character of a list made of these characters (another thing, if the url is unicode you can't use url.find("ó")), Then here comes the tricks ! The problem is the solution !

Finding the latin and the special characters ! how to find them ?! WITH THE EXCEPTION !

If urls (containing bad characters) are of type "unicode" converting them to string will raise an exception

If the urls (containing bad characters) are of type "str" converting them to unicode will raise an exception

We find the wanted characters with the exception ;-)

Then split the url at the position of that character, quote the charcters and at the end rebuild the url.

For my case, urls are unicode:

import sys
import urllib

from core.models import Song


songs = Song.objects.all()

for song in songs:
    try:
        x = str(song.song_url) #will cause exception with urls containing bad characters
    except(UnicodeEncodeError):
        k = sys.exc_info()
        pos = k[1][2] #getting the position of the bad character
        c = song.song_url[pos].encode("utf8")
        q =  urllib.quote(c)
        p1 = song.song_url[:pos] #splitted part one
        p2 = song.song_url[pos+1:] #splitted part two
        res = p1 + q + p2 #rebuit url
        song.song_url = res
        song.save()
        print res

Note if the url contains several "bad" characters, the above code will treat the first one in each url, so whether execute it in a recursive manner or run it several times until you get no ouput. I wish this helps.

Generic example where url is of type "str":

import sys
import urllib

url = "https://something.s3.amazonaws.com/music/something/thisó.mp3"

try:
    x = unicode(url)
except(UnicodeDecodeError):
    k = sys.exc_info()
    pos = k[1][2]
    url2 = url.decode('utf8')
    c = url2[pos].encode("utf8")
    q =  urllib.quote(c)
    p1 = url2[:pos]
    p2 = url2[pos+1:]
    res = p1 + q + p2
    print res

I wish the solution is helpful for anyone who come accross.

Python: URL encode urls with latin characters

1 Answers1