Can someone explain to me how the pause/resume feature in Scrapy
works?
The version of scrapy
that I'm using is 0.24.5
The documentation does not provide much detail.
I have the following simple spider:
class SampleSpider(Spider):
name = 'sample'
def start_requests(self):
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053')
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054')
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055')
def parse(self, response):
with open('responses.txt', 'a') as f:
f.write(response.url + '\n')
I'm running it using:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapyproject.spiders.sample_spider import SampleSpider
spider = SampleSpider()
settings = get_project_settings()
settings.set('JOBDIR', '/some/path/scrapy_cache')
settings.set('DOWNLOAD_DELAY', 10)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
As you can see, I enabled the JOBDIR option so that I can save the state of my crawl.
I set the DOWNLOAD_DELAY
to 10 seconds
so that I can stop the spider before the requests are processed. I would have expected that the next time I run the spider, the requests will not be regenerated. That is not the case.
I see in my scrapy_cache folder a folder named requests.queue. However, that is always empty.
It looks like the requests.seen file is saving the issued requests (using SHA1
hashes) which is great. However, the next time I run the spider, the requests are regenerated and the (duplicate) SHA1
hashes are added to the file. I tracked this issue in the Scrapy
code and it looks like the RFPDupeFilter
opens the requests.seen file with an 'a+' flag. So it will always discard the previous values in the file (at least that is the behavior on my Mac OS X).
Finally, regarding spider state, I can see from the Scrapy
code that the spider state is saved when the spider is closed and is read back when it's opened. However, that is not very helpful if an exception occurs (e.g., the machine shuts down). Do I have to be saving periodically?
The main question I have here is: What's the common practice to use Scrapy
while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?