0

I want to run two spiders in a coordinated fashion. First spider will scrape some website and produce URLs and the second one will consume these addresses. I can't wait for the first spider to finish and then launch the second one since the website is changing very fast and URLs produced by the first spider need to scraped right away. A very simple architecture is shown below. Currently, I am using Scrapy separately for each scraping job. Any idea how can I do it? Each spider behaves in different way (has different settings) and does different job. It would be nice to have them on different machines (distributed).

enter image description here

Bociek
  • 1,195
  • 2
  • 13
  • 28

2 Answers2

0

One idea, maybe its a bad idea

Run 1st spider that saves scrapd URLs into DB

Run 2nd Spider separately like this

def start_requests(self):
    while 1:
        select url from 1st_spider_urls
        yield Request(url)

        if first_spider_finished:
            break

it will keep getting URLs from table and scraping them immediately

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • How do I take the signal that the first spider has finished in a distributed system. They would have to be run in the same crawler process to achieve that. – WoofDoggy Mar 14 '19 at 11:23
0

Your two spiders can be still be independent. They do not need to be coordinated and they do not need to communicate with each other. Both just need access to a central database.

Spider1 is only responsible for populating a database table with URLs. And Spider2 is just responsible for reading from it (and maybe updating the rows if you want to keep track). Both spiders can start/stop independently. If Spider1 stops, then Spider2 can still keep going as long as there are URLs.

In the case where there are currently no more URLs for Spider2, you can keep it alive by configuring a spider_idle signal that raises a DontCloseSpider exception (documentation). At this point you can also fetch a new batch of URLs from the database and crawl them (example of crawling in signal).

Alternatively, you could just use something like cron to schedule an execution of Spider2 every few minutes. Then you don't have to worry about keeping it alive.

malberts
  • 2,488
  • 1
  • 11
  • 16