0

Im trying to use urllib2 to download a webpage and save it to a MySQL database. like this :

result_text = result.read()
result_text = result_text.decode('utf-8')

however I get this error :

Data: 'utf8' codec can't decode byte 0x88

Now, the HTML meta tag states that the encoding is indeed utf-8. Ive managed to get around this with this line :

result_text = result_text.decode('utf-8','replace')

Which replaces the bad characters. however, i'm not sure that this is not an indication that something could be wrong with the downloaded data, or that i'm removing valuable characters. IU should add that the page also contains JavaScript - although this shouldn't be a problem I believe.

Can anyone tell me why this is happening? Thanks

Meitham
  • 9,178
  • 5
  • 34
  • 45
WeaselFox
  • 7,220
  • 8
  • 44
  • 75
  • 1
    Are you sure that all characters on the page are in utf-8? Although the header says that, it could be "a lie" – DonCallisto Jan 29 '12 at 14:02
  • 4
    When you get the `'utf8' codec can't decode byte 0x88` error, it should also tell you the location of the offending byte. If the location is `n`, then add a print statement: `print(repr(result_text[n-20:n+20]))` before the call to `decode('utf-8')`, and post the result here. – unutbu Jan 29 '12 at 14:02
  • 1
    Because you did not post link to the source data we cannot give you a proper answer. However, the source data most likely has a bad UTF-8 encoding and there is nothing you cannot do about it. – Mikko Ohtamaa Jan 29 '12 at 14:36
  • unubtu, thanks for the response! here is the part of the string in question : `url:"\x98cW\x01\xa2\xbb\xba\xcc\xec\x90\xfc\xffP\xcb%\x01\x08",s` – WeaselFox Jan 29 '12 at 14:43
  • Mikko Ohtamaa - if so, than replacing the characters would be the right approach I guess... – WeaselFox Jan 29 '12 at 14:44
  • @WeaselFox: You guess wrongly. See my answer. – John Machin Jan 29 '12 at 20:51

1 Answers1

0

Analysis of your tiny data sample:

>>> s = "\x98cW\x01\xa2\xbb\xba\xcc\xec\x90\xfc\xffP\xcb%\x01\x08"
>>> u = s.decode('utf8', 'replace')
>>> u
u'\ufffdcW\x01\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdP\ufffd%\x01\x08'
>>> u.count(u'\ufffd')
9
>>> len(u)
16

(1) That's certainly not UTF-8 with an occasional invalid sequence; over 50% of the unicode characters are invalid. In other words, pressing ahead and using data.decode('utf8', 'replace') is NOT a good idea (based on this TINY sample).

(2) The characters \x01 (twice) and \x08 make me suspect that you have got binary data somehow.

(3) The (truncated) error message that you quoted in a comment mentioned 0x88 but there is no 0x88 in the sample data.

(4) Please edit your question to show what you should have done at the start: (a) the minimal code necessary to reproduce the problem, including the URL that you are accessing (b) the full error message and traceback (c) an assurance that you have copied/pasted (a) and (b) rather than typing from memory

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • 1
    ok, let me address your comments : 1. true but also the only invalid sequences are in this "url:" part that is small and for me insignificant. 2. from the code `result = proxy['opener'].open(request) result_text = result.read() result_text.decode('utf-8')` the url I cannot disclose.. 3. in different runs I got different invalid sequences. 4. rest assured I have copy/psated. – WeaselFox Jan 30 '12 at 08:04