I have a scrapy project where I need to store some scraped items in Redis.
I was thinking about writing my own pipeline class, but then I found scrapy-redis and decided to try it.
My question is: what should I do if the scraped item is invalid?
By invalid, I mean that as far as my application is concerned, this item should be discarded and not processed.
I know that if I write my own pipeline class, I can raise a DropItem
exception, but what can I do if I use RedisPipeline
?
I can think of two possible solutions:
- Subclass
RedisPipeline
, overrideprocess_item
, drop an invalid item, and delegate the processing of a valid item toRedisPipeline.process_item
. Then use this subclassed pipeline in my spiders. - Define another pipeline class responsible for dropping invalid items, and give this pipeline a higher priority.
I was thinking about something along these lines:
class DropItemPipeline(object):
def process_item(self, item, spider):
if not item["is_valid"]:
raise DropItem
else:
return item
See also: How can I use different pipelines for different spiders in a single Scrapy project