Error while using urllib.request.urlopen in Python

Question

What's wrong with this code?

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("utf-8"))


<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,

ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});

delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=

[a,b];return false}};

window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}

var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&

window._gjuc())&&setTimeout(_gjp,500)};

Traceback (most recent call last):
  File "<pyshell#109>", line 2, in <module>
    print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte

score 6 · Accepted Answer · edited Jan 03 '12 at 10:33

6

Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("cp1251"))

edited Jan 03 '12 at 10:33

Sergey

47,222
25
87
129

answered Jan 03 '12 at 09:09

demalexx

4,661
1
30
34

joaquin · Answer 2 · 2012-01-03T09:29:56.990

2

That's your failing line (last part of it):

>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte

The failing code is from a spanish word that has accent:

>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'

You will be ok with latins decoding accordingly:

>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'

btw, Imágenes is spanish images

edited Jan 03 '12 at 09:29

answered Jan 03 '12 at 09:13

joaquin

82,968
29
138
152

1

Seems Google returns localized page depending on IP. For me it's Russian and cp1251 encoding. For you it's Spanish and latin-1. – demalexx Jan 03 '12 at 12:24
@race1 Oh I see! Interesting... I was fooled because my error was at pos 2419 after the same line the OP posted. But the one of the OP is at 2364... These are coincident answers by coincidence, arent they? – joaquin Jan 03 '12 at 14:17

Error while using urllib.request.urlopen in Python

2 Answers2