-1

how can i do a scrapy spider that runs "forever".

So i will start again when it gets def closed(self, spider):

This is the function that calls when its at the end. I tested it with print a text. So everytime at the end, i have that text.

But how can i start the spider than again?

class Spider(scrapy.Spider):
    def start_requests(self):
        Spidercode...

    def closed(self, spider):
        print('END');

The spider start every round with "start_requests" and ends with closed()

DrBwts
  • 3,470
  • 6
  • 38
  • 62
togmer
  • 7
  • 3
  • Let's see your code for creating and starting the spider. I assume you'll want to put that code in a loop somehow. – CryptoFool Jan 02 '21 at 16:13
  • @Steve my idea was to jump from closed to start_request again? – togmer Jan 02 '21 at 16:17
  • why don't simply call `start_requests` in your `closed`. Maybe before you would also like to reset your object state if exists – Lior Cohen Jan 02 '21 at 16:17
  • Does this answer your question? [How to build a web crawler based on Scrapy to run forever?](https://stackoverflow.com/questions/2350049/how-to-build-a-web-crawler-based-on-scrapy-to-run-forever) – Gallaecio Feb 22 '21 at 03:40

1 Answers1

1
import scrapy
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor
from twisted.internet.task import deferLater
...
runner=CrawlerProcess(setting={})
def sleep(self, *args, seconds):
    """Non blocking sleep callback"""
    return deferLater(reactor, seconds, lambda: None
def crawl(result):
    d=runner.crawl(MySpider)
    d.addCallback(lambda results: print('waiting 0 seconds before restart...'))
    d.addErrback(crash)  # <-- add errback here
    d.addCallback(sleep, seconds=0) # call back in second
    d.addCallback(crawl)    
    return d  
crawl(None)
runner.start()


mtabbasi
  • 96
  • 1
  • 5