I am working on my first python project. I want to make a crawler that visits a website to extract all its links (with a depth of 2). It should store the links in two lists that form a ono-to-one register that correlates source links to the corresponding target links they contain. Then it should create a csv file with two columns (Target and Source), so I can open it with gephi to create a graph showing the site's topographic structure.
The code breaks down at the for loop in the code execution section, it just never stops extracting links... (I've tried with a fairly small blog, it just never ends). What is the problem? How can I solve it?
A few points to consider: - I'm really new to programming and python so I realize that my code must be really unpythonic. Also, as I have been searching for ways to build the code and solve my problems it is somewhat patchy, sorry. Thanks for your help!
myurl = raw_input("Introduce URL to crawl => ")
Dominios = myurl.split('.')
Dominio = Dominios[1]
#Variables Block 1
Target = []
Source = []
Estructura = [Target, Source]
links = []
#Variables Block 2
csv_columns = ['Target', 'Source']
csv_data_list = Estructura
currentPath = os.getcwd()
csv_file = "crawleo_%s.csv" % Dominio
# Block 1 => Extract links from a page
def page_crawl(seed):
try:
for link in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(seed).read(), re.I):
Source.append(seed)
Target.append(link)
links.append(link)
except IOError:
pass
# Block 2 => Write csv file
def WriteListToCSV(csv_file, csv_columns, csv_data_list):
try:
with open(csv_file, 'wb') as csvfile:
writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(csv_columns)
writer.writerows(izip(Target, Source))
except IOError as (errno, strerror):
print("I/O error({0}): {1}".format(errno, strerror))
return
# Block 3 => Code execution
page_crawl(myurl)
seed_links = (links)
for sublink in seed_links: # Problem is with this loop
page_crawl(sublink)
seed_sublinks = (links)
## print Estructura # Line just to check if code was working
#for thirdlinks in seed_sublinks: # Commented out until prior problems are solved
# page_crawl(thirdlinks)
WriteListToCSV(csv_file, csv_columns, csv_data_list)