decode URL only non-ascii character

Question

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.

Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?

Thanks :)

score 1 · Accepted Answer · answered Dec 13 '12 at 14:46

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

So you can match these patterns in the URL and unquote each character separately.

However, remember that not all URLs are encoded in UTF-8.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

Can you offer some more details on how to go about implementing your suggestions? E.g., what is a clean way to actually code ‘replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder’? The design that comes to my mind is to use a regex to find these occurrences, generate a UUID for each and replace them with this id, save the occurrence and its UUID in a dict, do the url decoding, then iterate on the dict and undo the replacements. This seems rather inefficient and ugly though ... — HappyFace, May 06 '21 at 14:11

decode URL only non-ascii character

1 Answers1

Linked

Related