14

I'm trying to go through a series of numbered data pages using urlib2. What I want to do is use a try statement, but I have little knowledge of it, Judging by reading up a bit, it seems to be based on specific 'names' that are exceptions, eg IOError etc. I don't know what the error code is I'm looking for, which is part of the problem.

I've written / pasted from 'urllib2 the missing manual' my urllib2 page fetching routine thus:

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()

    txheaders =  {'User-agent' : useragent}

    if os.path.isfile(COOKIEFILE):
        cj.load(COOKIEFILE)
        print "previous cookie loaded..."
    else:
        print "no ospath to cookfile"

    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)
    try:
        req = urllib2.Request(url, useragent)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'Failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,",
            print "is down, or we don't have an internet connection."
            return False

    else:
        print
        if cj is None:
            print "We don't have a cookie library available - sorry."
            print "I can't show you any cookies."
        else:
            print 'These are the cookies we have received so far :'
            for index, cookie in enumerate(cj):
                print index, '  :  ', cookie
                cj.save(COOKIEFILE)           # save the cookies again

        page = handle.read()
        return (page)

def fetch_series():

  useragent="Firefox...etc."
  url="www.example.com/01.html"
  try:
    fetch_page(url,useragent)
  except [something]:
    print "failed to get page"
    sys.exit()

The bottom function is just an example to see what I mean, can anyone tell me what I should be putting there ? I made the page fetching function return False if it gets a 404, is this correct ? So why doesn't except False: work ? Thanks for any help you can give.

ok well as per advice here ive tried:

except urlib2.URLError, e:

except URLError, e:

except URLError:

except urllib2.IOError, e:

except IOError, e:

except IOError:

except urllib2.HTTPError, e:

except urllib2.HTTPError:

except HTTPError:

none of them work.

kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 1
    For Python 3, see: [Get HTTP Error code from requests.exceptions.HTTPError](http://stackoverflow.com/q/19342111/55075) – kenorb Jul 28 '15 at 08:40

3 Answers3

36

You should catch urllib2.HTTPError if you want to detect a 404:

try:
    req = urllib2.Request(url, useragent)
    # create a request object

    handle = urllib2.urlopen(req)
    # and open it to return a handle on the url
except urllib2.HTTPError, e:
    print 'We failed with error code - %s.' % e.code

    if e.code == 404:
        # do stuff..  
    else:
        # other stuff...

    return False
else:
    # ...

To catch it in fetch_series():

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    try:
        urlopen()
        #...
    except IOError, e:
        # ...   
    else:
        #...

def fetch_series(): 
    useragent=”Firefox...etc.”
    url=”www.example.com/01.html
    try:
        fetch_page(url,useragent)
    except urllib2.HTTPError, e:
        print “failed to get page”

http://docs.python.org/library/urllib2.html:

exception urllib2.HTTPError
Though being an exception (a subclass of URLError), an HTTPError can also function as a non-exceptional file-like return value (the same thing that urlopen() returns). This is useful when handling exotic HTTP errors, such as requests for authentication.

code
An HTTP status code as defined in RFC 2616. This numeric value corresponds to a value found in the dictionary of codes as found in BaseHTTPServer.BaseHTTPRequestHandler.responses.

chown
  • 51,908
  • 16
  • 134
  • 170
  • I'm testing for that outside of the urllib2 function though, does that matter ? I sort of want it to be a generic function for many things and then look for types of errors outside it. –  Nov 24 '11 at 20:57
  • ok, I'll give that a go, thankyou, Going to have to be tomorrow now. digging the name too ;) –  Nov 24 '11 at 21:01
  • oh I see, I assumed I would be having to test the return value of the fetch. ah you dude. may your file permissions always be in perfect order and may your boxes never be 0wned by skiddz (at least) :D –  Nov 24 '11 at 21:17
  • hmmm, Didn't seem to work, just went straight on by it... I'll have another look tomorrow when I'm less tired. Many thanks again. –  Nov 24 '11 at 21:30
  • nope tried again with a bunch of different stuff i read from that link you sent, but it all just seems to ignore it... –  Nov 25 '11 at 08:27
  • yep. 'import urllib2', i fixed it by doing an if statement to check for false, I have no idea whatsoever about exceptions, all seems like a world of pain to me at the moment. one thing i realised is that i got the call wrong i was doing a page = fetch_page(blah,blah) rather than just straight calling it - would that make a difference ? –  Nov 25 '11 at 08:52
10

I recommend you check out the wonderful requests module.

With it you could achieve the functionality you are asking about like so:

import requests
from requests.exceptions import HTTPError

try:
    r = requests.get('http://httpbin.org/status/200')
    r.raise_for_status()
except HTTPError:
    print 'Could not download page'
else:
    print r.url, 'downloaded successfully'

try:
    r = requests.get('http://httpbin.org/status/404')
    r.raise_for_status()
except HTTPError:
    print 'Could not download', r.url
else:
    print r.url, 'downloaded successfully'
Acorn
  • 49,061
  • 27
  • 133
  • 172
  • 2
    So you are suggesting I write the whole thing again and use something else, or is this some add on to urllib2 ? bear in mind I'm a total newb to all this, took me ages to figure out how to get a page to download ! if it ain't broke don't fix it ;) Does this deal with cookies and redirects as well this request thing ? –  Nov 24 '11 at 21:19
  • I'm so tired I didn't begin by thanking you, so sorry for that. Many thanks for taking the time to help a brother out. –  Nov 24 '11 at 21:36
  • hey, youre right, that module is pretty cool, and although urllib2 isnt broken (it works for me right now) i see what you mean about the simplicity. thanks. –  Nov 25 '11 at 10:49
  • i had no idea what amazing advice this was beforehand, the difference is rather shocking. –  Nov 26 '11 at 22:15
2

Interactive poking:

For finding out about the nature and possible content of such exceptions in python best simply try the key calls interactively:

>>> f = urllib2.urlopen('http://httpbin.org/status/404')
Traceback (most recent call last):
...
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: NOT FOUND

Then sys.last_value contains the exception value which fell down to the interactive - and can be played with:
( use TAB + . auto-expansion of the interactive shell, dir(), vars() ...)

>>> ev = sys.last_value
>>> ev.__class__
<class 'urllib2.HTTPError'>
>>> dir(ev)
['_HTTPError__super_init', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'close', 'code', 'errno', 'filename', 'fileno', 'fp', 'getcode', 'geturl', 'hdrs', 'headers', 'info', 'message', 'msg', 'next', 'read', 'readline', 'readlines', 'reason', 'strerror', 'url']
>>> vars(ev)
{'fp': <addinfourl at 140193880 whose fp = <socket._fileobject object at 0x01062370>>, 'fileno': <bound method _fileobject.fileno of <socket._fileobject object at 0x01062370>>, 'code': 404, 'hdrs': <httplib.HTTPMessage instance at 0x085ADF80>, 'read': <bound method _fileobject.read of <socket._fileobject object at 0x01062370>>, 'readlines': <bound method _fileobject.readlines of <socket._fileobject object at 0x01062370>>, 'next': <bound method _fileobject.next of <socket._fileobject object at 0x01062370>>, 'headers': <httplib.HTTPMessage instance at 0x085ADF80>, '__iter__': <bound method _fileobject.__iter__ of <socket._fileobject object at 0x01062370>>, 'url': 'http://httpbin.org/status/404', 'msg': 'NOT FOUND', 'readline': <bound method _fileobject.readline of <socket._fileobject object at 0x01062370>>}
>>> sys.last_value.code
404

Try handling:

>>> try: f = urllib2.urlopen('http://httpbin.org/status/404')
... except urllib2.HTTPError, ev:
...     print ev, "'s error code is", ev.code
...     
HTTP Error 404: NOT FOUND 's error code is 404

Building a simple opener which doesn't throw HTTP errors:

>>> ho = urllib2.OpenerDirector()
>>> ho.add_handler(urllib2.HTTPHandler())
>>> f = ho.open('http://localhost:8080/cgi/somescript.py'); f
<addinfourl at 138851272 whose fp = <socket._fileobject object at 0x01062370>>
>>> f.code
500
>>> f.read()
'Execution error: <pre style="background-color:#faa">\nNameError: name \'e\' is not defined\n<pre>\n'

The default handlers of urllib2.build_opener:

default_classes = [ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor]

kxr
  • 4,841
  • 1
  • 49
  • 32