Scrapy get all urls from a domain and go beyond the domain by the depth of 2

Question

I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?

Any help is greatly appreciated.

Here is my code snippet:

start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
                            callback="parse_item", follow=True),
                 )

def parse_item(self, response):
    url = response.url
    master_domain = self.master_domain
    self.logger.info(master_domain)
    current_domain = tldextract.extract(url).domain
    referer = response.request.headers.get('Referer')
    depth = response.meta.get('depth')
    if current_domain == master_domain:
        yield {'url': url,
                     'referer': referer,
                     'depth': depth}
    elif current_domain != master_domain:
        if depth < 2:
            yield {'url': url,
                         'referer': referer,
                         'depth': depth}
        else:
            self.logger.debug('depth is greater than 3')

score 1 · Answer 1 · answered Nov 15 '20 at 05:18

1

Open settings and add

DEPTH_LIMIT = 2

For more details see

There is no need of checking the domain,

if current_domain == master_domain:

when you have allowed domains it will automatically follow only those domains mentioned in allowed_domains

answered Nov 15 '20 at 05:18

Assad Ali

288
1
12

Scrapy get all urls from a domain and go beyond the domain by the depth of 2

1 Answers1