Where should I bind the db/redis connection to on scrapy?

Question

Sorry to disturb you guys. This is bad question, seems what really confused me is how ItemPipeline works in scrapy. I'll close it and start a new question.

Where should I bind the db/redis connection to on scrapy, Spider or Pipeline.

In the scrapy document, mongo db connection is bind on Pipeline. But it could be also be bound to the Spider(It's also what extension scrapy-redis does). The later solution brings the benefit that the spider is accessible in more places besides pipeline, like middlewares.

So, which is the better way to do it?

I'm confused about that pipelines are run in parallel (this is what the doc says). Does it mean there're multiple instances of MyCustomPipeline?

Besides, connection pool of redis/db is preferred?

I just lack the field experience to make the decision. Need your help. Thanks in advance.

As the doc says, ItemPipeline is run in parallel. How? Are there duplicate instances of the ItemPipeline run in threads. (I noticed FilesPipeline uses deferred thread to save files into s3). Or there's only one instance of each pipeline and runs in the main event loop. If it's the later case, the connection pool doesn't seems to help. Cause when you use a redis connection, it's blocked. Only one connection could be used at the same time.

Always for connecting to any network, a network pool or connection pool is preferred because you can use multiple connections at the same time — bigbounty, Jul 13 '20 at 01:37

bigbounty · Accepted Answer · 2020-07-14T02:43:33.343

Understanding how scrapy architecture is more important here. Look at the below diagram

Spiders

Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specific domain (or group of domains).

Item Pipeline

The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).

When you understand the above architecture diagram, Scraper classes are used to scrape the website and item pipeline classes are used to process the items (scraped requests).

There are 2 scenarios here:

When you get the urls from any database

Here, in order to scrape websites, you need urls of the website. If those urls are stored in any database then it's better to bind the database connection objects to the scraper classes so that those can be fetched dynamically.

When you want to process the scraped items - Store data etc..

Here, you basically bind the database connection object to the Item Pipeline so that we can directly store the scraped data to the database.

Both binding the database connections to Scraper class and Pipeline class are correct depending on the scenario.

Question 2:

Connection pool of redis/db is preferred?

Yes, connection pool to any database is always preferred.

The connection pool maintains a generally steady-state collection of valid/open connections, assume 10. When the application needs to run a query or do an update, it “borrows” a connection from the pool by “opening” a connection. When it’s done, it “closes” the connection, which returns it to the pool for use by next request. Since the connection was already open, there is no overhead to obtaining the connection.

Source :https://qr.ae/pNs8jA

Thank for your help. But I've read the whole scrapy document already, and the answer doesn't explain how scrapy run in asynchronously. As the doc says, ItemPipeline is run in parallel. How? Are there duplicate instances of the ItemPipeline run in threads. (I noticed `FilesPipeline` uses deferred thread to save files into s3). Or there's only one instance of each pipeline and runs in the main event loop. If it's the later case, the connection pool doesn't seems to help. Cause when you use a redis connection, it's blocked. — Simba, Jul 14 '20 at 04:33
@Simba As I understand your question consisted of 3 questions. Where to bind the db? Do itemPipelines run in parallel? connection pool of redis/db is preferred? I believe I have answered 2 questions — bigbounty, Jul 14 '20 at 04:39
Sorry to mess up with too many questions. I guess what I really wanna know is how scrapy run asynchronously, to be specific how the ItemPipeline part works. Considering the question title, you did answered it correctly. I'll accept the answer and start a new question about what the pipeline thing. — Simba, Jul 14 '20 at 05:03

score 1 · Answer 2 · answered Jul 11 '20 at 13:07

1

The best practice would be to bind the connection in the pipelines, in order to follow with the separation of concerns principle.

Scrapy uses the same parallelism infrastructure for executing requests and processing items, as your spider yields items, scrapy will call the process_item method from the pipeline instance. Check it here.

A single instance of every pipeline is instantiated during the spider instantiation.

Besides, connection pool of redis/db is preferred?

Sorry, don't think I can help with this one.

answered Jul 11 '20 at 13:07

renatodvc

2,526
2
6
17

1

Thanks for your help. I think what really confused me is how the "ItemPipelines run in parallel", as what the doc say. Is it still running in the event loop or being started in a thread? Besides, I've seen `RedisSpider` class from package `scrapy-redis` bind the redis conn on the spider. If the spider and itempipeline run in the same event loop and the itempipeline is initialized once, I don't think the connection pool does any help. Cause when you use the redis, the process is block. – Simba Jul 14 '20 at 04:28

Where should I bind the db/redis connection to on scrapy?

2 Answers2