Sorry to disturb you guys. This is bad question, seems what really confused me is how ItemPipeline works in scrapy. I'll close it and start a new question.
Where should I bind the db/redis connection to on scrapy, Spider
or Pipeline
.
In the scrapy document, mongo db connection is bind on Pipeline
. But it could be also be bound to the Spider
(It's also what extension scrapy-redis
does). The later solution brings the benefit that the spider is accessible in more places besides pipeline, like middlewares.
So, which is the better way to do it?
I'm confused about that pipelines are run in parallel (this is what the doc says). Does it mean there're multiple instances of MyCustomPipeline
?
Besides, connection pool of redis/db is preferred?
I just lack the field experience to make the decision. Need your help. Thanks in advance.
As the doc says, ItemPipeline is run in parallel. How? Are there duplicate instances of the ItemPipeline run in threads. (I noticed FilesPipeline uses deferred thread to save files into s3). Or there's only one instance of each pipeline and runs in the main event loop. If it's the later case, the connection pool doesn't seems to help. Cause when you use a redis connection, it's blocked. Only one connection could be used at the same time.