How to raise scrapy.exceptions.DropItem in scrapy_redis.pipelines.RedisPipeline

Question

I have a scrapy project where I need to store some scraped items in Redis.

I was thinking about writing my own pipeline class, but then I found scrapy-redis and decided to try it.

My question is: what should I do if the scraped item is invalid?

By invalid, I mean that as far as my application is concerned, this item should be discarded and not processed.

I know that if I write my own pipeline class, I can raise a DropItem exception, but what can I do if I use RedisPipeline?

I can think of two possible solutions:

Subclass RedisPipeline, override process_item, drop an invalid item, and delegate the processing of a valid item to RedisPipeline.process_item. Then use this subclassed pipeline in my spiders.
Define another pipeline class responsible for dropping invalid items, and give this pipeline a higher priority.

I was thinking about something along these lines:

class DropItemPipeline(object):

def process_item(self, item, spider):
    if not item["is_valid"]:
        raise DropItem
    else:
        return item

See also: How can I use different pipelines for different spiders in a single Scrapy project

you can have multiple pipelines. – eLRuLL Aug 09 '18 at 20:04 — eLRuLL, Aug 09 '18 at 20:04

score 1 · Accepted Answer · answered Aug 09 '18 at 20:12

You can set up multiple pipelines for your project, so you can use ScrapyRedis pipeline with the one you write for dropping items:

ITEM_PIPELINES = {
    'my.own.Pipeline': 299,
    'scrapy_redis.pipelines.RedisPipeline': 300,
}

On your own pipeline just drop the items. Check that the previous pipeline should have a lower priority (299 in my example) than RedisPipeline, so when the items gets dropped it never reaches the following pipelines.

How to raise scrapy.exceptions.DropItem in scrapy_redis.pipelines.RedisPipeline

1 Answers1