10

Can someone explain to me how the pause/resume feature in Scrapy works?

The version of scrapy that I'm using is 0.24.5

The documentation does not provide much detail.

I have the following simple spider:

class SampleSpider(Spider):
name = 'sample'

def start_requests(self):
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055')

def parse(self, response):
    with open('responses.txt', 'a') as f:
        f.write(response.url + '\n')

I'm running it using:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals


from scrapyproject.spiders.sample_spider import SampleSpider
spider = SampleSpider()
settings = get_project_settings()
settings.set('JOBDIR', '/some/path/scrapy_cache')
settings.set('DOWNLOAD_DELAY', 10)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() 

As you can see, I enabled the JOBDIR option so that I can save the state of my crawl.

I set the DOWNLOAD_DELAY to 10 seconds so that I can stop the spider before the requests are processed. I would have expected that the next time I run the spider, the requests will not be regenerated. That is not the case.

I see in my scrapy_cache folder a folder named requests.queue. However, that is always empty.

It looks like the requests.seen file is saving the issued requests (using SHA1 hashes) which is great. However, the next time I run the spider, the requests are regenerated and the (duplicate) SHA1 hashes are added to the file. I tracked this issue in the Scrapy code and it looks like the RFPDupeFilter opens the requests.seen file with an 'a+' flag. So it will always discard the previous values in the file (at least that is the behavior on my Mac OS X).

Finally, regarding spider state, I can see from the Scrapy code that the spider state is saved when the spider is closed and is read back when it's opened. However, that is not very helpful if an exception occurs (e.g., the machine shuts down). Do I have to be saving periodically?

The main question I have here is: What's the common practice to use Scrapy while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?

Jithin
  • 1,692
  • 17
  • 25
Abdul
  • 399
  • 5
  • 15
  • It looks like you run scrapy inside a python script. Can you stop the reactor/scrapy periodically? From my past experience, `reactor.run()` always blocks the script, so I couldn't call `reactor.stop()`. I thought about running scrapy in another thread and send a terminate signal to that thread, but I haven't tried. – Hieu Jul 19 '16 at 04:48

3 Answers3

9

For being able to pause and resume the scrapy search, you can run this command for starting the search:

scrapy crawl somespider --set JOBDIR=crawl1

for stopping the search you should run control-C, but you have to run it just once and wait for scrapy to stop, if you run control-C twice it wont work properly.

then you can resume your search by running this command again:

scrapy crawl somespider --set JOBDIR=crawl1
Maryam Homayouni
  • 905
  • 9
  • 16
  • 1
    The documentation does not mentions anything about the number of signals. Could you please link the reference? – olegario Jul 20 '18 at 13:47
  • This method only records the queued requests, but if you want to raise other requests based on the output of one request, like I did, you will lose all that info and in essence will have to start the spider again. – Anmol Deep Mar 02 '23 at 10:15
3

The version of scrapy that I'm using is 1.1.0

you need to set the correct JOBDIR in settings.py

JOBDIR = 'PROJECT_DIR'

After stoping spider by control+c, you can run the spider to continue scraping the rest again.

It should work after that

Boseam
  • 165
  • 2
  • 2
  • This setting can also be set in the custom_settings inside of a python script: `class MyCoolSpider(scrapy.Spider): name = 'mycool-spider' custom_settings = { 'JOBDIR': f'PROJECT_DIR_{datetime.now().strftime("%Y%m%d%H%M%S")}' }` – jeffsdata Nov 18 '22 at 14:52
2

Re: The main question I have here is: What's the common practice to use Scrapy while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?

If you don't want to use Scrapy's pause/resume, you can always serialize your requests. I am giving an example below:

If you crawl 10000 URLs first, and then scrape these 10000 URLs in a new crawler by sequentially requesting, you can simply serialize these URLs based on some rules, and import csv in the spider:

file = open('your10000_urls.csv', 'r')
data = csv.reader(file)
urls = list(data)
url = ['']
for i in urls:
    url.append(i[0])
start_urls = url[1:]
file.close()

And then, you can keep track of these requests by dropping the ones that are already requested. Further, you may want to store data in a database, it makes life much easier.

Hope it helps.

kevin
  • 1,107
  • 1
  • 13
  • 17
  • you have accurately captured the essense of the question. Thanks! Could you please give me some idea about serializing ASP.NET FormRequests with the __EVENT_STATE etc variables. And if possible please elaborate about the role of yield keyword in this. – Anmol Deep Mar 02 '23 at 10:11