0

I have a scrapy project where I need to store some scraped items in Redis.

I was thinking about writing my own pipeline class, but then I found scrapy-redis and decided to try it.

My question is: what should I do if the scraped item is invalid?

By invalid, I mean that as far as my application is concerned, this item should be discarded and not processed.

I know that if I write my own pipeline class, I can raise a DropItem exception, but what can I do if I use RedisPipeline?

I can think of two possible solutions:

  1. Subclass RedisPipeline, override process_item, drop an invalid item, and delegate the processing of a valid item to RedisPipeline.process_item. Then use this subclassed pipeline in my spiders.
  2. Define another pipeline class responsible for dropping invalid items, and give this pipeline a higher priority.

I was thinking about something along these lines:

class DropItemPipeline(object):

def process_item(self, item, spider):
    if not item["is_valid"]:
        raise DropItem
    else:
        return item

See also: How can I use different pipelines for different spiders in a single Scrapy project

jackdbd
  • 4,583
  • 3
  • 26
  • 36

1 Answers1

1

You can set up multiple pipelines for your project, so you can use ScrapyRedis pipeline with the one you write for dropping items:

ITEM_PIPELINES = {
    'my.own.Pipeline': 299,
    'scrapy_redis.pipelines.RedisPipeline': 300,
}

On your own pipeline just drop the items. Check that the previous pipeline should have a lower priority (299 in my example) than RedisPipeline, so when the items gets dropped it never reaches the following pipelines.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99