1

I'm scraping news.crunchbase.com with Scrapy. The callback function on following recursive links doesn't fire in case if I follow the actual link encountered, but it works fine if I crawl some test link instead. I assume that the problem is in timing, hence want to delay the recursive request.

EDIT: answer from here does set the global delay, but it doesn't adjust the recursive delay. Recursive link crawl is done instantly - just as soon as data has been scraped.

def parse(self, response):
    time.sleep(5)
    for post in response.css('div.herald-posts'):   
        article_url = post.css('div.herald-post-thumbnail a::attr(href)').get()

        if article_url is not None:
            print('\nGot article...', article_url, '\n')
            yield response.follow(article_url, headers = self.custom_headers, callback = self.parse_article)

        yield {
            'title': post.css('div.herald-post-thumbnail a::attr(title)').get(),
        }
  • Possible duplicate of [Scrapy delay request](https://stackoverflow.com/questions/30404364/scrapy-delay-request) – vezunchik May 06 '19 at 13:55
  • Nope, I've tried solutions from that code - that deals only with delays per spider, there's nothing about recursive delays – maksimKorzh May 06 '19 at 13:56

1 Answers1

0

This was actually enough.

custom_settings = {
   "DOWNLOAD_DELAY": 5,
   "CONCURRENT_REQUESTS_PER_DOMAIN": 1
}

Child requests are put into request queue and processed after the parent requests. In case if requests are not related to the current domain the DOWNLOAD_DELAY is ignored and request is done instantly.

P.S. I just didn't wait until start_requests(self) processes the entire list of url, hence thought I was banned. Hope this helps somebody.