I am using Scrapy framework to make spiders crawl through some webpages. Basically, what I want is to scrape web pages and save them to database. I have one spider per webpage. But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling. How can that be achieved? Is scrapyd the solution?
Asked
Active
Viewed 568 times
1 Answers
1
scrapyd is indeed a good way to go, max_proc or max_proc_per_cpu configuration can be used to restrict the number of parallel spdiers, you will then schedule spiders using scrapyd rest api like:
$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

Guy Gavriely
- 11,228
- 6
- 27
- 42
-
i have two spiders: spider1 and spider2. Now how to start doing it? – Nabin Feb 11 '14 at 06:45
-
But "scrapy deploy" doesn't work. Say "Usage ===== scrapy deploy [options] [ [target] | -l | -L
] deploy: error: Unknown target: default " – Nabin Feb 11 '14 at 07:54 -
And where is schedule.json file? Or do I have to create one? @Guy Gavriely – Nabin Feb 11 '14 at 08:55
-
A browse of the rest of the scrapyd documentation may prove useful: http://scrapyd.readthedocs.org/en/latest/ – Talvalin Feb 11 '14 at 12:14