0

I feel like I am missing something very basic here about the limits of python processes. I have a screen scraper that is supposed to go to a password-protected site once a week, filling out a form to update existing records and then grabbing new records. (I'm using Django to actually insert the records, if that matters).

The data I'm scraping builds out over the course of the year. So in January, the process is relatively quick. By August, there are thousands of rows to update in addition to whatever new records have been added.

It's worked like a dream this year, but recently started running into a connection error with this traceback:

Traceback (most recent call last):
  File "douglasdivorces.py", line 42, in <module>
    forms = [f for f in br.forms()]
  File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4.py2.6.egg/mechanize/_mechanize.py", line 420, in forms
return self._factory.forms()
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_html.py", line 557, in forms
self._forms_factory.forms())
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_html.py", line 237, in forms
_urlunparse=_rfc3986.urlunsplit,
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_form.py", line 844, in ParseResponseEx
_urlunparse=_urlunparse,
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_form.py", line 979, in _ParseFileEx
data = file.read(CHUNK)
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_response.py", line 195, in read
data = self.wrapped.read(to_read)
File "/usr/lib/python2.6/socket.py", line 353, in read
    data = self._sock.recv(left)
    File "/usr/lib/python2.6/httplib.py", line 518, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.6/httplib.py", line 551, in _read_chunked
    line = self.fp.readline()
  File "/usr/lib/python2.6/socket.py", line 397, in readline
    data = recv(1)
  File "/usr/lib/python2.6/ssl.py", line 96, in <lambda>
    self.recv = lambda buflen=1024, flags=0: SSLSocket.recv(self, buflen, flags)
  File "/usr/lib/python2.6/ssl.py", line 217, in recv
    return self.read(buflen)
  File "/usr/lib/python2.6/ssl.py", line 136, in read
    return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer

Is thee a way to account for this error, holding my loops in place until the problem is resolved? Or is there another approach I should take?

Again, my hope is that I'm missing something preschool level, so I'll save you all the pain of posting my code. If it's not that simple, say the word and I'll edit the question to include the script.

Thanks so much! Very curious to hear what's causing me fits!

user1046162
  • 55
  • 1
  • 6

1 Answers1

0

socket.error: [Errno 104] Connection reset by peer

I.e the server doesn't like your request. Maybe they changed something.

dan-klasson
  • 13,734
  • 14
  • 63
  • 101
  • I should be more specific. If they're changing something, it's happening quickly and then getting changed right back. The script fills out the form a few thousand times. It will get through 1,000 or so without a problem, then break with the error above. If I fire it again, it has no problems again for awhile. – user1046162 Aug 08 '13 at 19:13
  • Maybe they have started applying some throttling limit. – dan-klasson Aug 08 '13 at 19:20
  • Thanks! Is there a way to deal with that? I can sleep for as long as necessary, but at the point I hit the error everything breaks. – user1046162 Aug 08 '13 at 20:31
  • Just apply `sleep()` each x times in your loop and see if that helps. – dan-klasson Aug 08 '13 at 21:17
  • No dice. Even if I have it sleep for ten second after every loop, it eventually just fails. My guess has been that the site I'm scraping is finicky and drops at odd times. If that's happening, is there a way to check for it? Super sorry to be asking such simple questions -- I've been banging my head against the wall trying to get it work, searching for solutions, but nothing has done it so far. – user1046162 Aug 08 '13 at 21:40
  • Wow, very late answer. But anyways: there's some tricks like setting a random wait between requests, and also randomly chaning the headers with which one visits the website – Willem van Houten Mar 19 '23 at 15:09