7

I've written a crawler that uses urllib2 to fetch URLs.

every few requests I get some weird behaviors, I've tried analyzing it with Wireshark and couldn't understand the problem.

getPAGE() is responsible for fetching the URL. it returns the content of the URL (response.read()) if it successfully fetches the URL, else it returns None.

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            return response.read()
    return None

This is the function that calls getPAGE() and checks if the the page I've fetched is valid (checking with - companyID = soup.find('span',id='lblCompanyNumber').string If companyID is None the page is not valid), if the page is valid it saves the soup object to a global variable named 'curRes'.

def isValid(ID):
    global curRes
    try:
        address = urlPath+str(ID)
        page = getPAGE(address)
        if page == None:
            saveToCsv(ID, badRequest = True)
            return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                curRes = soup #we have the data we need, save the soup obj to a global variable
                return True
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

the strange behaviors are -

  1. there are times that urllib2 executes a GET request and without waiting for the reply it sends the next GET request (ignoring the last request)
  2. sometimes I get "[errno 10054] An existing connection was forcibly closed by the remote host" after the code is simply stuck for about 20 minutes or so waiting for a response from the server, while it stucks I copy the URL and try to fetch it manually and I get a response in less than 1 sec (?).
  3. getPAGE() function will return None to isValid() if it failed to return the url, sometimes I get the Error -

Error while parsing this page, third exception block: 'NoneType' object has no attribute 'string' id:....

that's weird because I'm creating the soup object just if I got a valid result from getPAGE(), and it seems that the soup function is returning None, which is raising an exception whenever I try to run

companyID = soup.find('span',id='lblCompanyNumber').string

the soup object should never be None, it should get the HTML from getPAGE() if it reaches that part of the code

I've checked and saw that the problem is somehow connected to the first problem (sending GET and not waiting for the reply, I saw (on WireShark) that each time I got that exception it was for a url that urllib2 sent a GET request but didn't wait for the response and moved on, getPAGE() should have returned None for that url, but if it would return None isValid(ID) wouldn't pass the "if page == None:" condition, I can't find out why it is happening, it's impossible to replicate the issue.

I've read that time.sleep() can cause issues with urllib2 threading, so maybe I should avoid using it?

why doesn't urllib2 always wait for the response (it happens rarely that it doesn't wait)?

what can I do about the "[errno 10054] An existing connection was forcibly closed by the remote host" Error? BTW - the exception isn't caught by getPAGE() try: except block, it is caught by the first isValid() try: except: block, which is also weird cause getPAGE() suppose to catch all the exceptions it throws.

try:
    address = urlPath+str(ID)
    page = getPAGE(address)
    if page == None:
        saveToCsv(ID, badRequest = True)
        return False
except Exception, e:
    print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address

Thanks!

amarynets
  • 1,765
  • 10
  • 27
YSY
  • 1,226
  • 3
  • 13
  • 19
  • You've carefully guarded `urlopen(req)` but `response.read()` can raise an exception too. – Gareth Rees Jul 25 '11 at 19:43
  • can't I assume that if the code gets to the else clause, response.read() will return a valid content? the response object suppose to be fine if the url is fetched? – YSY Jul 25 '11 at 19:55
  • 1
    The response object is fine, but you can't assume that reading from it will succeed. The connection might get closed mid-response. – Gareth Rees Jul 25 '11 at 20:00
  • wow, I thought that urlopen returns the content and that read() simply gives the content of the url that is saved in one of the responses attributes. I couldn't find any example that wraps the read() function, how do I validate that it's complete? – YSY Jul 25 '11 at 20:16
  • in the end I've replaced the urllib2 with httplib2 which fixed the problem – YSY Jul 29 '11 at 20:42
  • I have a similar problem and switching to httplib2 did not fix it for me. Now I just get the "existing connection was forcibly close" error out of httlib2. Is there a safer alternative to time.sleep() that I can use if that truly is the culprit? – ThatAintWorking May 02 '12 at 21:00

0 Answers0