Python3: byte array decoding from urlopen

Asked Jun 10 '13 at 19:41

Active Jun 10 '13 at 19:43

Viewed 438 times

I'm trying to use python to find some words across webpages (just to practice) but I keep running into a problem. This is it:

url = 'someWikipage'
hdrs = { 'User-Agent': "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11" } 
req = request.Request(url,None,hdrs)
response = urlopen(req)
htmlBytes = response.read()
htmlBytes.decode('utf-8')

It brakes on the last line giving me an error (a common one);

UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in position 18573: character maps to <undefined>

Any ideas about how to prevent or ignore this?

edited Jun 10 '13 at 19:43

jamylak

128,818
30
231
230

asked Jun 10 '13 at 19:41

Tim

2,000
4
27
45

Are you certain that you're reading a `bytes`? – Ignacio Vazquez-Abrams Jun 10 '13 at 19:42
Why do you need to `decode`? Seems that htmlBytes is already unicode. – Paulo Bu Jun 10 '13 at 19:55
@PauloBu unicode doesn't have a decode method – jamylak Jun 10 '13 at 20:24
A small part of the output of `htmlBytes` is `b' \n\n\n'` which lets me assume it is indeed a byte array. Also the documentation confirms this. – Tim Jun 10 '13 at 20:48

Python3: byte array decoding from urlopen

0 Answers0