5

I have a project in which I have to crawl a great number of different sites. All of this sites crawling can use the same spider, as I don't need to extract items from its body pages. The approach I thought is to parametrize the domain to be crawled in the spider file, and call the scrapy crawl command passing the domain and starting urls as parameters, so I could avoid generate a single spider for every site (the sites list will increase over time). The idea is to deploy it to a server with scrapyd running, so several questions come to me:

  • Is this the best approach I can take?
  • If so, is there any concurrency problem if I schedule several times the same spider with different arguments passed?
  • If this is not the best approach, and it is better to create one single spider per site... I will have to update the project frecuently. Does a project update affect running spiders?
Frederic Bazin
  • 1,530
  • 12
  • 27

1 Answers1

6

Spider design

There 2 approach to building domain spider

  1. sending a list of urls to a single spider as argument
  2. running multiple instance of the same spider with a different start_url as argument

The first approach is the most straight forward and easy to test ( you can run with scrapy crawl ) and it's fine in many cases. The second approach is less convenient to use but easier to write the code:

  1. sending list of urls to a single spider as argument:
    • minimal CPU footprint: launching 1 single process for all urls
    • user friendly: can run as scrapy crawl or scrapyd
    • harder to debug: no domain restriction
  2. running 1 instance for each start_url
    • heavy resource footprint: launching 1 dedicated process for each url
    • not user friendly: need to create an external script to launch spiders and feed urls.
    • easier to debug: write the code to run 1 domain at a time

.

from urlparse import urlparse
...
class .....(Spider):
    def __init__(*args, *kwargs):
        ...
        self.start_urls = ....
        ...
        self.allowed_domains = map(lambda x: urlparse(x).netloc, self.start_urls)

I would recommend second approach only if you experience programming challenges. Otherwise stick to option 1 for the sake of simplicity and scalability

Concurrency

You can control concurrency through settings.py by adding CONCURRENT_REQUESTS_BY_DOMAIN variable.

Project update

Both architecture require writing only 1 single spider. You instantiate the spider only once (option 1 ) or once per url (option 2 ). You don't need to ever write multiple spider.

FYI: Updating project does not affect running spiders.

Frederic Bazin
  • 1,530
  • 12
  • 27
  • 1
    Why is the first option a better choice? I can't see the advantages. About the project update, I will remake the question, and then I will answer myself. I wanted to know if the running scrapy spiders under a scrapyd service would be affected if I made a scrapyd deploy while they were running. And the answer is no, they keep running without further problems. – Bernardo Botella Jul 07 '14 at 11:28
  • 1
    I can't give you a positive vote as I haven't enough reputation. – Bernardo Botella Jul 07 '14 at 11:32
  • Thanks for your feedback. I believe, I clarified all your question. What do you think ? – Frederic Bazin Jul 08 '14 at 13:31
  • I don't understand how the second option is possible since crawlers use spider classes (not instances) to start. So you have to make a class for every domain you want to crawl. I know it doesn't make sense but I can't find another way to do it – Mr Alexander Oct 14 '16 at 12:29
  • @AlexPatchanka, Crawler actually uses instances (the parse method etc.. are instance method, not class methods). Whether you instanciate multiple spiders within a single process or into individual processes is another challenge. I think you are confused with the member `allowed_domains` which could be set at runtime (based on input) rather than in the source code. – Frederic Bazin Oct 16 '16 at 06:34
  • @FredericBazin How can I set `allowed_domains` at runtime and not in the source code? – Mr Alexander Oct 17 '16 at 09:28