1

I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?

Any help is greatly appreciated.

Here is my code snippet:

start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
                            callback="parse_item", follow=True),
                 )

def parse_item(self, response):
    url = response.url
    master_domain = self.master_domain
    self.logger.info(master_domain)
    current_domain = tldextract.extract(url).domain
    referer = response.request.headers.get('Referer')
    depth = response.meta.get('depth')
    if current_domain == master_domain:
        yield {'url': url,
                     'referer': referer,
                     'depth': depth}
    elif current_domain != master_domain:
        if depth < 2:
            yield {'url': url,
                         'referer': referer,
                         'depth': depth}
        else:
            self.logger.debug('depth is greater than 3')
Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
crawler
  • 11
  • 1

1 Answers1

1

Open settings and add

DEPTH_LIMIT = 2

For more details see

There is no need of checking the domain,

if current_domain == master_domain:

when you have allowed domains it will automatically follow only those domains mentioned in allowed_domains

Assad Ali
  • 288
  • 1
  • 12