I have the following Scrapy parse method:
def parse(self, response):
item_loader = ItemLoader(item=MyItem(), response=response)
for url in response.xpath('//img/@src').extract():
item_loader.add_value('image_urls', response.urljoin(url))
yield item_loader.load_item()
# If item['images_matched'] == True:
# yield Request(links, callback=parse)
This sends the extracted image URLs to the ImagePipelines. I need to make Scrapy crawl additional links from that page, if a certain condition is met ... something like ... the checksum of the image contents is a match for a list of hashes.
My problem is that I don't know how to access the Item once it's finished in the ImagesPipeline and it's populated with all that data. Meaning item['images_matched']
does not get populated in the parse method, but the pipelines. Need help with either accessing the Item or a different approach to this
EDIT: I've discovered that adding the following, after yield
, works.
yield Request(link, callback=parse, meta={'item': item_loader.load_item()})
However, this seems like incredibly bad coding to me as the item dict can be quite large at times. Passing that just to check one attribute is weird. Is there a better way?