Scrapy best practice: Connect to database in crawler or in pipeline?

Asked Feb 09 '18 at 22:33

Active May 19 '18 at 19:14

Viewed 193 times

I am scraping a main page that has a list of items. Within my pipeline I connect to a database to store the items. My next task is to go to each individual item page and scrape comments. I need to connect to the database again to see if I've already scraped the comments.

Is it more efficient for me connect to the database in the pipeline or in the crawl script?
Is there a way to return from the pipeline and tell the crawler to crawl the comments?

edited May 19 '18 at 19:14

Vadim Kotov

8,084
8
48
62

asked Feb 09 '18 at 22:33

Learning C

Why don't you design it the way that you scrape the item, pass it in `meta` to another `Request` for individual item page, scrape the comments and update the item from `meta` and yield it in the end? – Tomáš Linhart Feb 11 '18 at 15:58
That was my initial thought. I wasn't sure if that was the best way to do it. Thanks for the comment. – Learning C Mar 02 '18 at 22:10
It doesn't have to be the best way in every case, but it's common enough to be part of Scrapy [FAQ](https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments). – Tomáš Linhart Mar 03 '18 at 05:50

Scrapy best practice: Connect to database in crawler or in pipeline?

0 Answers0