1

This question is essentially the same as Pass scraped URL's from one spider to another, but I'd like to double-check whether there is no 'Scrapy-native' way to do this.

I'm scraping web pages which 99% of the time can be scraped successfully without rendering JavaScript. Sometimes, however, this fails and certain Fields are not present. I'd like to write a Scrapy Extension with an item_scraped method which checks if all expected fields are populated and if not, yield a SplashRequest to a different spider with custom_settings including the Splash settings (cf. https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/).

Is there any Scrapy way to do this without using an external service (like Redis)?

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
Kurt Peek
  • 52,165
  • 91
  • 301
  • 526

1 Answers1

4

Enabling scrapy-splash only makes SplashRequest work, it does not affect regular scrapy.Request (if there is no 'splash' in request.meta).

You can include Splash settings and still yield scrapy.Request - they will be processed without Splash.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • In https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ there is mention of setting the `DUPEFILTER_CLASS` to `SplashAwareDupefilter`, which would also affect the regular `scrapy.Request`, no? – Kurt Peek Jul 20 '17 at 13:04
  • SplashAwareDupefilter is the same as standard dupefilter for non-splash requests, so if you're using the default dupefilter SplashAwareDupefilter is a drop-in replacement. But if you want to use a non-default dupefilter, you need to create your own version, which works both for Splash requests and non-Splash requests. – Mikhail Korobov Jul 20 '17 at 13:11