Find all possible links in a website / Screen-Web Scraping with Python

Question

Sort of an open-ended question here. I needed to go across a Job site and search for a Job Description tag and a Skill requirement (I'm done with this). I basically wanted to know, how do I crawl across the site? As in, go from test.com to test.com/a and so on....?? Basically, crawl the page.

This is my code to search within the page. I need to find all the possible such pages in the site and get the link. THIS IS NOT HOMEWORK. I'm just doing this on the side...

import urllib2
import re

html_content = urllib2.urlopen('http://www.ziprecruiter.com/job/Systems-     Engineer/b5452eab/?source=customer-cpc-indeed').read()

matchDescription = re.findall('Bachelor', html_content);
matchSkill = re.findall('VMware', html_content);


print matchDescription
print matchSkill

if ( len(matchDescription) and len(matchSkill) )== 0: 
   print 'I did not find anything'
else:
   print 'My string is in the html'

score 0 · Accepted Answer · answered Apr 01 '13 at 01:20

Consider using Scrapy or some other existing scraping framework. Otherwise, you need to find the necessary links manually using lxml or some other HTML parser and crawl them using some manual mechanism based on urllib or something like that and some data structures to store input and output data.

Find all possible links in a website / Screen-Web Scraping with Python

1 Answers1