Scrapy - condition based crawling

Question

I have the following Scrapy parse method:

def parse(self, response):
        item_loader = ItemLoader(item=MyItem(), response=response)
        for url in response.xpath('//img/@src').extract():
            item_loader.add_value('image_urls', response.urljoin(url))
        yield item_loader.load_item()
        # If item['images_matched'] == True:
        # yield Request(links, callback=parse)

This sends the extracted image URLs to the ImagePipelines. I need to make Scrapy crawl additional links from that page, if a certain condition is met ... something like ... the checksum of the image contents is a match for a list of hashes.

My problem is that I don't know how to access the Item once it's finished in the ImagesPipeline and it's populated with all that data. Meaning item['images_matched'] does not get populated in the parse method, but the pipelines. Need help with either accessing the Item or a different approach to this

EDIT: I've discovered that adding the following, after yield, works.

yield Request(link, callback=parse, meta={'item': item_loader.load_item()})

However, this seems like incredibly bad coding to me as the item dict can be quite large at times. Passing that just to check one attribute is weird. Is there a better way?

score 1 · Answer 1 · answered Jul 29 '17 at 09:22

1

Just assign the item to a variable and then yield that variable:

item = item_loader.load_item()
yield item
if item['images_matched']:
    yield Request(links, callback=parse)

The 'if' statement will run after the pipeline.

answered Jul 29 '17 at 09:22

Paulo Romeira

751
7
11

Thanks. I actually tried something like that, but it wasn't working. Not sure where my mistake was as I deleted that code. – Akustik Aug 02 '17 at 09:51
Okay, so I'm having a problem with that. The variable in the item is initially set to `false`, but then updated to `true` (sometimes) in the pipelines. The problem is that with this method the variable is accessed before it's changed by the pipeline, so it's almost always `false`. How do I get that to work? – Akustik Jun 11 '18 at 14:14

Scrapy - condition based crawling

1 Answers1