I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?
Any help is greatly appreciated.
Here is my code snippet:
start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
callback="parse_item", follow=True),
)
def parse_item(self, response):
url = response.url
master_domain = self.master_domain
self.logger.info(master_domain)
current_domain = tldextract.extract(url).domain
referer = response.request.headers.get('Referer')
depth = response.meta.get('depth')
if current_domain == master_domain:
yield {'url': url,
'referer': referer,
'depth': depth}
elif current_domain != master_domain:
if depth < 2:
yield {'url': url,
'referer': referer,
'depth': depth}
else:
self.logger.debug('depth is greater than 3')