0

I am trying to get the below program working. It is supposed to find email addresses in a website but, it is breaking. I suspect the problem is with initializing result = [] inside the crawl function. Below is the code:

# -*- coding: utf-8 -*-
import requests
import re
import urlparse

# In this example we're trying to collect e-mail addresses from a website

# Basic e-mail regexp:
# letter/number/dot/comma @ letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+@[\w\.,]+\.\w+)')

# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')

def crawl(url, maxlevel):
    result = []
    # Limit the recursion, we're not downloading the whole Internet
    if(maxlevel == 0):
        return

    # Get the webpage
    req = requests.get(url)
    # Check if successful
    if(req.status_code != 200):
        return []

    # Find and follow all the links
    links = link_re.findall(req.text)
    for link in links:
        # Get an absolute URL for a link
        link = urlparse.urljoin(url, link)
        result += crawl(link, maxlevel - 1)

    # Find all emails on current page
    result += email_re.findall(req.text)
    return result

emails = crawl('http://ccs.neu.edu', 2)

print "Scrapped e-mail addresses:"
for e in emails:
    print e

The error I get is below:

C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
    emails = crawl('http://ccs.neu.edu', 2)
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
    result += crawl(link, maxlevel - 1)
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
    result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable

Process finished with exit code 1

Any suggestions will help. Thanks!

Un1x
  • 3
  • 3

1 Answers1

1

The problem is this:

if(maxlevel == 0):
    return

Currently it return None when maxlevel == 0. You can't concatenate a list with a None object. You need to return an empty list [] to be consistent.

Tuan Anh Hoang-Vu
  • 1,994
  • 1
  • 21
  • 32
  • Thanks, that worked. However, a curious question, why will result not be set to an empty array the next time that function is called? Also, how can I initialize result outside the function and use it inside? Is there something like a public variable in Python? I tried global, didn't work. – Un1x Jun 05 '15 at 22:52
  • Yes Python do have global variables: http://stackoverflow.com/questions/423379/using-global-variables-in-a-function-other-than-the-one-that-created-them – Tuan Anh Hoang-Vu Jun 05 '15 at 22:56