How to yield a Scrapy Request to another spider with different settings?

Question

This question is essentially the same as Pass scraped URL's from one spider to another, but I'd like to double-check whether there is no 'Scrapy-native' way to do this.

I'm scraping web pages which 99% of the time can be scraped successfully without rendering JavaScript. Sometimes, however, this fails and certain Fields are not present. I'd like to write a Scrapy Extension with an item_scraped method which checks if all expected fields are populated and if not, yield a SplashRequest to a different spider with custom_settings including the Splash settings (cf. https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/).

Is there any Scrapy way to do this without using an external service (like Redis)?

score 4 · Accepted Answer · answered Jul 20 '17 at 11:08

4

Enabling scrapy-splash only makes SplashRequest work, it does not affect regular scrapy.Request (if there is no 'splash' in request.meta).

You can include Splash settings and still yield scrapy.Request - they will be processed without Splash.

answered Jul 20 '17 at 11:08

Mikhail Korobov

21,908
8
73
65

In https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ there is mention of setting the `DUPEFILTER_CLASS` to `SplashAwareDupefilter`, which would also affect the regular `scrapy.Request`, no? – Kurt Peek Jul 20 '17 at 13:04
SplashAwareDupefilter is the same as standard dupefilter for non-splash requests, so if you're using the default dupefilter SplashAwareDupefilter is a drop-in replacement. But if you want to use a non-default dupefilter, you need to create your own version, which works both for Splash requests and non-Splash requests. – Mikhail Korobov Jul 20 '17 at 13:11

How to yield a Scrapy Request to another spider with different settings?

1 Answers1