3

I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?

I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.

MaxLudv
  • 33
  • 3
  • Parsing logic for all of these web-sites is the same, right? – alecxe Jul 02 '13 at 15:47
  • 1
    The parsing logic is the same: I have a number of xpaths in the database. The easy way is just to throw everything in one spider and all parsing in the parse-callback (with a call to the database). – MaxLudv Jul 10 '13 at 11:34
  • Yup, sounds like one spider with overriden `start_requests` method. – alecxe Jul 10 '13 at 11:36

1 Answers1

2

If parsing logic is same then there are two methods,

  1. For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
  2. You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls

Passing parameters in scrapy

    scrapy crawl spider_name -a start_url=your_url

In scrapyd replace -a with -d

Tasawer Nawaz
  • 927
  • 8
  • 19