0

I'm writing a web crawler of Wikipedia with Python. I extract the language information of the pages,which contain mulitple characters of language such as Chinese,Japanese When I got the strings I want and print them out, they are coded in ascii. so the result is like :

...('Vietnamese', 'vi', 'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t') {'confidence': 1.0, 'encoding': 'ascii'} ('Turkish', 'tr', 'T\xc3\xbcrk\xc3\xa7e') {'confidence': 1.0, 'encoding': 'ascii'} ('Ukrainian', 'uk', '\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd1\x97\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0') {'confidence': 1.0, 'encoding': 'ascii'} ('Chinese', 'zh', '\xe4\xb8\xad\xe6\x96\x87') {'confidence': 1.0, 'encoding': 'ascii'}

My code:

def getLanguageContent(content):
    mainPattern = re.compile(matchReg)
    mainContentMatch = mainPattern.findall(content)
    return mainContentMatch

arr = getLanguageContent(getContentFromURL(sitePrefix))
print arr
for a in arr:
   a = str(a)
   print a

arr is a list like [('Simple English', 'simple', 'Simple English'), ('Arabic', 'ar', '\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'), ....]

I want to know how can I deal with this problem and print the string in their right decoding.Thanks a lot

JTT
  • 109
  • 4
  • 14
  • `'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t'` is not coded in ASCII, that's clearly UTF-8. For that matter, you _can't_ code `'Tiếng Việt'` in ASCII, at least not without throwing away information (e.g., `'Tieng Viet'`). – abarnert Dec 13 '14 at 05:39
  • Please show us the actual contents of `arr`, or the `getContentFromURL` function, or both, because otherwise it's impossible to do anything but guess at all the things you _could_ be doing wrong here. – abarnert Dec 13 '14 at 05:46
  • I thought it is in ASCII because the result of chardet.detect(a) is ascii is something like {'confidence': 1.0, 'encoding': 'ascii'} – JTT Dec 13 '14 at 07:52
  • arr is actually a list of tuples, so I have to call str() first – JTT Dec 13 '14 at 07:53
  • You _definitely_ don't want to call `str` on a tuple, or a list of tuples! In fact, that's your whole problem. Let me edit my answer to explain. – abarnert Dec 13 '14 at 08:28
  • I understand why I am wrong, thank you so much! – JTT Dec 13 '14 at 09:38

2 Answers2

1

First, 'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t' is not coded in ASCII. It's clearly UTF-8. For that matter, you can't code 'Tiếng Việt' in ASCII, at least not without throwing away information (e.g., 'Tieng Viet'). And when I run chardet.detect on all of the strings in your example, I get UTF-8, with confidences ranging from 0.7525 and 0.99.

Your problem is that arr is a list of tuples of strings, not a list of strings. When you call str(a), on a tuple, what that does is to call repr on each element, then wrap the whole thing in quotes and parentheses and commas and so forth. The repr of a string is always in ASCII, with backslash escapes for non-ASCII, and ASCII-but-not-printable, characters. For example, str(('Vietnamese', 'vi', 'Tiếng Việt')) is "('Vietnamese', 'vi', 'Ti\\xe1\\xba\\xbfng Vi\\xe1\\xbb\\x87t')". That's not a useful string.

Instead of trying to figure out how to make a useless string useful, just use the useful strings you already have. Don't call str on a list of tuples of strings, or on each tuple of strings. Just use the strings inside each tuple. For example:

for language, code, name in arr:
    print name

That will (assuming your console can handle UTF-8) print out Tiếng Việt. Or, if you want to decode it to unicode, just uname = name.decode('utf-8'). Or, if you want to call chardet.detect(name), it'll verify that it's UTF-8 with 0.7525 confidence. And so on.

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

This sounds strange. Ascii does not contain chinese or japanese characters.They are probably encoded using utf8. What you want is str(a).decode("utf-8") to decode the string which is encoded in utf-8. If you try to use str(a).decode("ascii") it should give you an error. But if you want to print them out, your terminal should support utf-8 encoding, so try just printing str(a).

Also, you don't have your entire program written, so I am assuming that str(a) is a sentence string.

kolonel
  • 1,412
  • 2
  • 16
  • 33