I'm writing a web crawler of Wikipedia with Python. I extract the language information of the pages,which contain mulitple characters of language such as Chinese,Japanese When I got the strings I want and print them out, they are coded in ascii. so the result is like :
...('Vietnamese', 'vi', 'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t') {'confidence': 1.0, 'encoding': 'ascii'} ('Turkish', 'tr', 'T\xc3\xbcrk\xc3\xa7e') {'confidence': 1.0, 'encoding': 'ascii'} ('Ukrainian', 'uk', '\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd1\x97\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0') {'confidence': 1.0, 'encoding': 'ascii'} ('Chinese', 'zh', '\xe4\xb8\xad\xe6\x96\x87') {'confidence': 1.0, 'encoding': 'ascii'}
My code:
def getLanguageContent(content):
mainPattern = re.compile(matchReg)
mainContentMatch = mainPattern.findall(content)
return mainContentMatch
arr = getLanguageContent(getContentFromURL(sitePrefix))
print arr
for a in arr:
a = str(a)
print a
arr is a list like [('Simple English', 'simple', 'Simple English'), ('Arabic', 'ar', '\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'), ....]
I want to know how can I deal with this problem and print the string in their right decoding.Thanks a lot