I feel like I am missing something very basic here about the limits of python processes. I have a screen scraper that is supposed to go to a password-protected site once a week, filling out a form to update existing records and then grabbing new records. (I'm using Django to actually insert the records, if that matters).
The data I'm scraping builds out over the course of the year. So in January, the process is relatively quick. By August, there are thousands of rows to update in addition to whatever new records have been added.
It's worked like a dream this year, but recently started running into a connection error with this traceback:
Traceback (most recent call last):
File "douglasdivorces.py", line 42, in <module>
forms = [f for f in br.forms()]
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4.py2.6.egg/mechanize/_mechanize.py", line 420, in forms
return self._factory.forms()
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_html.py", line 557, in forms
self._forms_factory.forms())
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_html.py", line 237, in forms
_urlunparse=_rfc3986.urlunsplit,
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_form.py", line 844, in ParseResponseEx
_urlunparse=_urlunparse,
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_form.py", line 979, in _ParseFileEx
data = file.read(CHUNK)
File "/usr/local/lib/python2.6/dist-packages/mechanize-0.2.4-py2.6.egg/mechanize/_response.py", line 195, in read
data = self.wrapped.read(to_read)
File "/usr/lib/python2.6/socket.py", line 353, in read
data = self._sock.recv(left)
File "/usr/lib/python2.6/httplib.py", line 518, in read
return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 551, in _read_chunked
line = self.fp.readline()
File "/usr/lib/python2.6/socket.py", line 397, in readline
data = recv(1)
File "/usr/lib/python2.6/ssl.py", line 96, in <lambda>
self.recv = lambda buflen=1024, flags=0: SSLSocket.recv(self, buflen, flags)
File "/usr/lib/python2.6/ssl.py", line 217, in recv
return self.read(buflen)
File "/usr/lib/python2.6/ssl.py", line 136, in read
return self._sslobj.read(len)
socket.error: [Errno 104] Connection reset by peer
Is thee a way to account for this error, holding my loops in place until the problem is resolved? Or is there another approach I should take?
Again, my hope is that I'm missing something preschool level, so I'll save you all the pain of posting my code. If it's not that simple, say the word and I'll edit the question to include the script.
Thanks so much! Very curious to hear what's causing me fits!