I am writing a web crawler using Mechanize and BeautifulSoup4 in Python. In order to store the data it collects for further analysis, I am using the shelve module. The block of code in which an issues arises is here.
url_dict=shelve.open("url_dict.dat")
html=r.read()
soup=BeautifulSoup(html)
frames=soup.find_all("a",{"class":br_class})#br_class is defined globally
time.sleep(1)
for item in frames:
url_suffix=item['href']
full_url=url_prefix+url_suffix
full_url=full_url.encode('ascii','ignore')
if str(full_url) not in url_dict:
url_dict[str(full_url)]=get_information(full_url,sr)
time.sleep(1)
However, this code does manage to go through one loop before encountering an error. The function get_information()
starts off as the following:
def get_information(full_url,sr):
information_set=dict()
r=sr.open(full_url)
information_set['url']=full_url
print("Set url")
html=r.read()
soup=BeautifulSoup(html)
information_set["address"]=soup.find("h1",{"class":"prop-addr"}).text
sr is a browser object which reads the url, and url_suffix is a unicode string. The object returned from get_information() is a dictionary object. So url_dict is a dictionary of dictionaries.
On the second loop of this code, I encounter the following error:
Traceback (most recent call last):
File "collect_re_data.py", line 219, in <module>
main()
File "collect_re_data.py", line 21, in main
data=get_html_data()
File "collect_re_data.py", line 50, in get_html_data
url_dict[str(full_url)]=get_information(full_url,sr)
File "C:\Python27\lib\shelve.py", line 132, in __setitem__
p.dump(value)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded while calling a Python object
Also, is there a better way of handling data storage for something like this? My end goal is to transfer all the data into a .csv
file so I can analyze it in R.