3

I get some wikipedia URL from freebase dump:

url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

They both refer to the same page on wikipedia:

url 3: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brandão_Costa

urllib.unquote works on url 1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url

result is

Pedro_Miguel_de_Castro_Brandão_Costa

but not work on url 2.

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url

result is

Pedro_Miguel_de_Castro_Brand�o_Costa    

Are there something wrong?

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
icycandy
  • 1,193
  • 2
  • 12
  • 20

1 Answers1

4

The former is double-quoted UTF-8, which prints out normally since your terminal uses UTF-8. The latter is quoted Latin-1, which requires decoding first.

>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 1
    I need to add `encode('utf8')` to correctly print out. That is, `print '...'.decode('latin-1').encode('utf8')`. Many thanks for your quick help. – icycandy Dec 19 '14 at 07:19