Scrapy: Use Feed Exports after custom Item Pipeline without custom Feed Exporter class?

Question

My Spider looks like this:

class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,},
        'FEEDS': {
            'feeds/example/tags.csv': {
                'format': 'csv',
                'fields': ["tag_id", "url", "title"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleTagItem],
                'overwrite': False
            },
            'feeds/example/galleries.csv': {
                'format': 'csv',
                'fields': ["id", "url", "tag_ids"],
                'item_export_kwargs': {
                    'include_headers_line': False,
                },
                'item_classes': [ExampleGalleryItem],
                'overwrite': False,
            }
        }
    }

This is the img_clear.pipelines.DuplicatesPipeline:

class DuplicatesPipeline():
    def open_spider(self, spider):
        if spider.name == "example":
            with open("feeds/example/galleries.csv", "r") as rf:
                csv = rf.readlines()
            self.ids_seen = set([str(line.split(",")[0]) for line in csv])
            
            with open("feeds/example/tags.csv", "r") as rf:
                tags_csv = rf.readlines()
            self.tag_ids_seen = set([str(line.split(",")[0]) for line in tags_csv])

    def process_item(self, item, spider):
        if isinstance(item, ExampleTagItem):
            self.process_example_tag_item(item, spider)    
        elif isinstance(item, ExampleGalleryItem):
            self.process_example_gallery_item(item, spider)

    def process_example_tag_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
            return item

    def process_example_gallery_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
            return item

With the item pipeline activated it will drop some items (logging: [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {'tag_id': '4',...) and return others (logging: [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/10232335/>) but noting is written to the files. Somehow the returned items don't seem to reach the feed exports extension. What am I missing?

When commenting out the 'ITEM_PIPELINES': {'img_clear.pipelines.DuplicatesPipeline': 100,}, in the custom_settings, items are saved in the right csv-files.
Using scrapy crawl example -o test.csv will create an empty csv when the pipeline is activated as well. So it seems that the issue is with the pipeline.
Printing the items right before they should be returned did print correct item information
The pipeline is derived from the scrapy docs.

score -1 · Answer 1 · answered Jan 24 '23 at 15:45

Thanks for the response! I'm not sure if this would actually have fixed it, since the feed was working perfectly with relative paths when the pipeline is deactivated. I might test that anyways some time.

However, I figured out an other mistake in my code that fixed it without changing the paths: The docs state, that the process_item function must return an item object, return a twisted Deferred or raise a DropItem exception. My code was derived from here but I missed the return statements in the lines calling the process_..._item functions.

Tbh, I discovered the solution by accident trying to replicate my issue in a less complex spider and wrote up something like this and it worked:

def process_item(self, item, spider):
    if isinstance(item, ExampleTagItem):
        adapter = ItemAdapter(item)
        if adapter['tag_id'] in self.tag_ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.tag_ids_seen.add(adapter['tag_id'])
        return item
    elif isinstance(item, ExampleGalleryItem):
        adapter = ItemAdapter(item)
        if adapter['id'] in self.ids_seen:
            raise DropItem(f"Duplicate item found: {item!r}")
        else:
            self.ids_seen.add(adapter['id'])
        return item

Since I'm very new to coding: Any suggestions how to reduce the repetition in this code? I could use "id" in both Item objects but still would need to differentiate between the two sets so no idea how to do this...

Scrapy: Use Feed Exports after custom Item Pipeline without custom Feed Exporter class?

1 Answers1