0

I am using Scrapy framework to make spiders crawl through some webpages. Basically, what I want is to scrape web pages and save them to database. I have one spider per webpage. But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling. How can that be achieved? Is scrapyd the solution?

Nabin
  • 11,216
  • 8
  • 63
  • 98

1 Answers1

1

scrapyd is indeed a good way to go, max_proc or max_proc_per_cpu configuration can be used to restrict the number of parallel spdiers, you will then schedule spiders using scrapyd rest api like:

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
  • i have two spiders: spider1 and spider2. Now how to start doing it? – Nabin Feb 11 '14 at 06:45
  • But "scrapy deploy" doesn't work. Say "Usage ===== scrapy deploy [options] [ [target] | -l | -L ] deploy: error: Unknown target: default " – Nabin Feb 11 '14 at 07:54
  • And where is schedule.json file? Or do I have to create one? @Guy Gavriely – Nabin Feb 11 '14 at 08:55
  • A browse of the rest of the scrapyd documentation may prove useful: http://scrapyd.readthedocs.org/en/latest/ – Talvalin Feb 11 '14 at 12:14