Scrapy managing dynamic spiders

Question

I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?

I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.

Parsing logic for all of these web-sites is the same, right? — alecxe, Jul 02 '13 at 15:47
The parsing logic is the same: I have a number of xpaths in the database. The easy way is just to throw everything in one spider and all parsing in the parse-callback (with a call to the database). — MaxLudv, Jul 10 '13 at 11:34
Yup, sounds like one spider with overriden `start_requests` method. — alecxe, Jul 10 '13 at 11:36

score 2 · Accepted Answer · answered Sep 12 '14 at 07:43

If parsing logic is same then there are two methods,

For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls

Passing parameters in scrapy

    scrapy crawl spider_name -a start_url=your_url

In scrapyd replace -a with -d

Scrapy managing dynamic spiders

1 Answers1