0

I am scraping a main page that has a list of items. Within my pipeline I connect to a database to store the items. My next task is to go to each individual item page and scrape comments. I need to connect to the database again to see if I've already scraped the comments.

Is it more efficient for me connect to the database in the pipeline or in the crawl script?
Is there a way to return from the pipeline and tell the crawler to crawl the comments?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Learning C
  • 679
  • 10
  • 27
  • Why don't you design it the way that you scrape the item, pass it in `meta` to another `Request` for individual item page, scrape the comments and update the item from `meta` and yield it in the end? – Tomáš Linhart Feb 11 '18 at 15:58
  • That was my initial thought. I wasn't sure if that was the best way to do it. Thanks for the comment. – Learning C Mar 02 '18 at 22:10
  • It doesn't have to be the best way in every case, but it's common enough to be part of Scrapy [FAQ](https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments). – Tomáš Linhart Mar 03 '18 at 05:50

0 Answers0