1

I am writing a web crawler using Mechanize and BeautifulSoup4 in Python. In order to store the data it collects for further analysis, I am using the shelve module. The block of code in which an issues arises is here.

url_dict=shelve.open("url_dict.dat")
html=r.read()
soup=BeautifulSoup(html)
frames=soup.find_all("a",{"class":br_class})#br_class is defined globally
time.sleep(1)
for item in frames:
    url_suffix=item['href']
    full_url=url_prefix+url_suffix
    full_url=full_url.encode('ascii','ignore')
    if str(full_url) not in url_dict:
        url_dict[str(full_url)]=get_information(full_url,sr)
    time.sleep(1)

However, this code does manage to go through one loop before encountering an error. The function get_information() starts off as the following:

def get_information(full_url,sr):   
    information_set=dict()
    r=sr.open(full_url)
    information_set['url']=full_url
    print("Set url")
    html=r.read()
    soup=BeautifulSoup(html)
    information_set["address"]=soup.find("h1",{"class":"prop-addr"}).text

sr is a browser object which reads the url, and url_suffix is a unicode string. The object returned from get_information() is a dictionary object. So url_dict is a dictionary of dictionaries.

On the second loop of this code, I encounter the following error:

Traceback (most recent call last):
  File "collect_re_data.py", line 219, in <module>
    main()
  File "collect_re_data.py", line 21, in main
    data=get_html_data()
  File "collect_re_data.py", line 50, in get_html_data
    url_dict[str(full_url)]=get_information(full_url,sr)
  File "C:\Python27\lib\shelve.py", line 132, in __setitem__
    p.dump(value)
  File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
    getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded while calling a Python object

Also, is there a better way of handling data storage for something like this? My end goal is to transfer all the data into a .csvfile so I can analyze it in R.

Max Candocia
  • 4,294
  • 35
  • 58
  • Looks like the `BeautifulSoup` `NavigableString` objects are not so shelvable. Try using `unicode(soup.find("h1",{"class":"prop-addr"}).text)` instead. – Martijn Pieters Oct 17 '13 at 17:50
  • Also, `get_information()` doesn't appear to *return* anything. Are you certain you are not missing a `return information_set` or something in that function? – Martijn Pieters Oct 17 '13 at 17:51
  • Is it possible that whatever `get_information` returns is contained (directly or indirectly) in itself (or contains `url_dict`)? Meanwhile, since `shelve` just uses `pickle.dumps(value)`, this might be a whole lot easier to debug if you stripped it down to just "build the value, then pickle it". – abarnert Oct 17 '13 at 17:56

1 Answers1

0

This is a known issue with pickle and BeautifulSoup. I suspect that the issue with shelve is related.

Community
  • 1
  • 1
speedplane
  • 15,673
  • 16
  • 86
  • 138
  • But I convert any data passed to string, so shouldn't the issue be avoided, or is it a problem because I've loaded BeautifulSoup at all? I'll try messing with the format of the string values later, but it's really frustrating. Also, are there any other recommendations you have for data storage modules/libraries? – Max Candocia Oct 17 '13 at 21:49
  • 2
    Nevermind, I forgot to convert a few values to text. Now it's working. – Max Candocia Oct 17 '13 at 22:24