I am planning on deploying a Scrapy spider to ScrapingHub and using the schedule feature to run the spider on a daily basis. I know that, by default, Scrapy does not visit the same URLs. However, I was wondering if this duplicate URL avoidance is persistent across scheduled starts on ScrapingHub? And whether or not I can set it so that Scrapy does not visit the same URLs across its scheduled starts.
1 Answers
DeltaFetch is a Scrapy plugin that stores fingerprints of visited URLs across different Spider runs. You can use this plugin for incremental (delta) crawls. Its main purpose is to avoid requesting pages that have been already scraped before, even if it happened in a previous execution. It will only make requests to pages from where no items were extracted before, to URLs from the spiders' start_urls attribute or requests generated in the spiders' start_requests method.
See: https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/
Plugin repository: https://github.com/scrapy-plugins/scrapy-deltafetch
In Scrapinghub's dashboard, you can activate it on the Addons Setup page, inside a Scrapy Cloud project. Though, you'll also need to activate/enable DotScrapy Persistence addon for it to work.

- 4,003
- 1
- 18
- 29