How to crawl multiple domain with scrapy

Question

I have a project in which I have to crawl a great number of different sites. All of this sites crawling can use the same spider, as I don't need to extract items from its body pages. The approach I thought is to parametrize the domain to be crawled in the spider file, and call the scrapy crawl command passing the domain and starting urls as parameters, so I could avoid generate a single spider for every site (the sites list will increase over time). The idea is to deploy it to a server with scrapyd running, so several questions come to me:

Is this the best approach I can take?
If so, is there any concurrency problem if I schedule several times the same spider with different arguments passed?
If this is not the best approach, and it is better to create one single spider per site... I will have to update the project frecuently. Does a project update affect running spiders?

Frederic Bazin · Accepted Answer · 2016-10-17T13:49:03.070

6

Spider design

There 2 approach to building domain spider

sending a list of urls to a single spider as argument
running multiple instance of the same spider with a different start_url as argument

The first approach is the most straight forward and easy to test ( you can run with scrapy crawl ) and it's fine in many cases. The second approach is less convenient to use but easier to write the code:

sending list of urls to a single spider as argument:
- minimal CPU footprint: launching 1 single process for all urls
- user friendly: can run as scrapy crawl or scrapyd
- harder to debug: no domain restriction
running 1 instance for each start_url
- heavy resource footprint: launching 1 dedicated process for each url
- not user friendly: need to create an external script to launch spiders and feed urls.
- easier to debug: write the code to run 1 domain at a time

.

from urlparse import urlparse
...
class .....(Spider):
    def __init__(*args, *kwargs):
        ...
        self.start_urls = ....
        ...
        self.allowed_domains = map(lambda x: urlparse(x).netloc, self.start_urls)

I would recommend second approach only if you experience programming challenges. Otherwise stick to option 1 for the sake of simplicity and scalability

Concurrency

You can control concurrency through settings.py by adding CONCURRENT_REQUESTS_BY_DOMAIN variable.

Project update

Both architecture require writing only 1 single spider. You instantiate the spider only once (option 1 ) or once per url (option 2 ). You don't need to ever write multiple spider.

FYI: Updating project does not affect running spiders.

edited Oct 17 '16 at 13:49

answered Jul 05 '14 at 00:33

Frederic Bazin

1,530
12
27

1

Why is the first option a better choice? I can't see the advantages. About the project update, I will remake the question, and then I will answer myself. I wanted to know if the running scrapy spiders under a scrapyd service would be affected if I made a scrapyd deploy while they were running. And the answer is no, they keep running without further problems. – Bernardo Botella Jul 07 '14 at 11:28
1

I can't give you a positive vote as I haven't enough reputation. – Bernardo Botella Jul 07 '14 at 11:32
Thanks for your feedback. I believe, I clarified all your question. What do you think ? – Frederic Bazin Jul 08 '14 at 13:31
I don't understand how the second option is possible since crawlers use spider classes (not instances) to start. So you have to make a class for every domain you want to crawl. I know it doesn't make sense but I can't find another way to do it – Mr Alexander Oct 14 '16 at 12:29
@AlexPatchanka, Crawler actually uses instances (the parse method etc.. are instance method, not class methods). Whether you instanciate multiple spiders within a single process or into individual processes is another challenge. I think you are confused with the member `allowed_domains` which could be set at runtime (based on input) rather than in the source code. – Frederic Bazin Oct 16 '16 at 06:34
@FredericBazin How can I set `allowed_domains` at runtime and not in the source code? – Mr Alexander Oct 17 '16 at 09:28

How to crawl multiple domain with scrapy

1 Answers1