1

I am working on my first python project. I want to make a crawler that visits a website to extract all its links (with a depth of 2). It should store the links in two lists that form a ono-to-one register that correlates source links to the corresponding target links they contain. Then it should create a csv file with two columns (Target and Source), so I can open it with gephi to create a graph showing the site's topographic structure.

The code breaks down at the for loop in the code execution section, it just never stops extracting links... (I've tried with a fairly small blog, it just never ends). What is the problem? How can I solve it?

A few points to consider: - I'm really new to programming and python so I realize that my code must be really unpythonic. Also, as I have been searching for ways to build the code and solve my problems it is somewhat patchy, sorry. Thanks for your help!

myurl = raw_input("Introduce URL to crawl => ")
Dominios = myurl.split('.')
Dominio = Dominios[1]

#Variables Block 1
Target = []
Source = []
Estructura = [Target, Source]
links = []

#Variables Block 2
csv_columns = ['Target', 'Source']
csv_data_list = Estructura
currentPath = os.getcwd()
csv_file = "crawleo_%s.csv" % Dominio


# Block 1 => Extract links from a page
def page_crawl(seed):
    try:
        for link in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(seed).read(), re.I):
            Source.append(seed)
            Target.append(link)
            links.append(link)
    except IOError:
        pass

# Block 2 => Write csv file
def WriteListToCSV(csv_file, csv_columns, csv_data_list):
try:
        with open(csv_file, 'wb') as csvfile:
            writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC)
            writer.writerow(csv_columns)
            writer.writerows(izip(Target, Source))
    except IOError as (errno, strerror):
            print("I/O error({0}): {1}".format(errno, strerror))
    return

# Block 3 => Code execution
page_crawl(myurl)
seed_links = (links)

for sublink in seed_links:        # Problem is with this loop
    page_crawl(sublink)
    seed_sublinks = (links)
## print Estructura               # Line just to check if code was working

#for thirdlinks in seed_sublinks: # Commented out until prior problems are solved
#   page_crawl(thirdlinks)

WriteListToCSV(csv_file, csv_columns, csv_data_list)
Carlos
  • 13
  • 3

2 Answers2

1

seed_links and links points to the same list. So when you are adding elements to links in the page_crawl function you are also extending the list that the for loop is looping over. What you need to do is to clone the list where you create seed_links.

This is because Python passes objects by reference. That is, multiple variables can point to the same object under different names!

If you want to see this with your own eyes, try print sublink inside the for loop. You will notice that there are more links printed than you initially put in. You will probably also notice that you are trying to loop over the entire web :-)

Community
  • 1
  • 1
Emil Vikström
  • 90,431
  • 16
  • 141
  • 175
1

I don't see immediately what is wrong. However there are several remarks about this:

  1. you work with global variables which is bad practice. You better use a local variable that is passed back by the return.
  2. Is it possible that a link in the second level refers back to the first level? That way you have a loop in the data. You need to make provisions for that to prevent a loop. So you need to investigate what is returned.
  3. I would implement this recursively (with the earlier provisions), because that makes the code simpler albeit a little more abstract.
Dick Kniep
  • 518
  • 4
  • 11
  • Thank you for your comments, I will pay attention to the way I use variables in the future. – Carlos Jan 19 '16 at 14:29