I'm used to running spiders one at a time, because we mostly work with scrapy crawl
and on scrapinghub, but I know that one can run multiple spiders concurrently, and I have seen that middlewares often have a spider
parameter in their callbacks.
What I'd like to understand is:
- the relationship between
Crawler
andSpider
. If I run one spider at a time, I'm assuming there's one of each. But if you run more spiders together, like in the example linked above, do you have one crawler for multiple spiders, or are they still 1:1? - is there in any case only one instance of a middleware of a certain class, or do we get one per-spider or per-crawler?
- Assuming there's one, what are the
crawler.settings
in the middleware creation (for example, here)? In the documentation it says that those take into account the settings overridden in the spider, but if there are multiple spiders with conflicting settings, what happens?
I'm asking because I'd like to know how to handle spider-specific settings. Take again the DeltaFetch middleware as an example:
- enabling it seems to be a global matter, because
DELTAFETCH_ENABLED
is read from the crawler.settings - however, the sqlite db is opened in
spider_opened
and is a unique instance variable (i.e., not depending on the spider); so if you have more than one spider and the instance is shared, when the second spider is opened, the old db is lost. And if you have only one instance of the middleware per spider, why bother passing the spider as a parameter?
Is that a correct way of handling it, or should you rather have a dict spider_dbs
indexed by spider name?