Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å%90é¤%90廳
which obviously makes no sense to me. My question is what is this encoding called? And is there a way to use Python to decode this character string. Thank you.
Asked
Active
Viewed 7,104 times
3

Ashwini Chaudhary
- 244,495
- 58
- 464
- 504

boobami
- 101
- 2
- 11
-
Encodings are listed here, you can try print "
) – monkut Sep 07 '12 at 11:07
2 Answers
6
It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.
It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:
>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 親å%90é¤%90廳'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳

Martijn Pieters
- 1,048,767
- 296
- 4,058
- 3,343
-
just out of curiosity though why is this particular mutt encoding used on the web statistics site? – boobami Sep 07 '12 at 12:50
-
@jih: usually due to inexperience with international character sets and encodings. – Martijn Pieters Sep 07 '12 at 12:51
-
1
You can use chardet. Install the library with:
pip install chardet
# or for python3
pip3 install chardet
The library includes a cli utility chardetect
(or chardetect3
accordingly) that takes the path to a file.
Once you know the encoding you can use it in python for example like this:
codecs.open('myfile.txt', 'r', 'GB2312')
or from shell:
iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt
If you need more performance then there is also cchardet — a faster C-optimized version of chardet
.

ccpizza
- 28,968
- 18
- 162
- 169