Scrapy, Python: Multiple Item Classes in one pipeline?

Question

I have a Spider that scrapes data which cannot be saved in one item class.

For illustration, I have one Profile Item, and each Profile Item might have an unknown number of Comments. That is why I want to implement Profile Item and Comment Item. I know I can pass them to my pipeline simply by using yield.

However, I do not know how a pipeline with one parse_item function can handle two different item classes?
Or is it possible to use different parse_item functions?
Or do I have to use several pipelines?
Or is it possible to write an Iterator to a Scrapy Item Field?

comments_list=[]
comments=response.xpath(somexpath)
for x in comments.extract():
        comments_list.append(x)
    ScrapyItem['comments'] =comments_list

score 25 · Answer 1 · edited Apr 19 '23 at 03:26

By default every item goes through every pipeline.

For instance, if you yield a ProfileItem and a CommentItem, they'll both go through all pipelines. If you have a pipeline setup to tracks item types, then your process_item method could look like:

def process_item(self, item, spider):
    self.stats.inc_value('typecount/%s' % type(item).__name__)
    return item

When a ProfileItem comes through, 'typecount/ProfileItem' is incremented. When a CommentItem comes through, 'typecount/CommentItem' is incremented.

You can have one pipeline handle only one type of item request, though, if handling that item type is unique, by checking the item type before proceeding:

def process_item(self, item, spider):
    if not isinstance(item, ProfileItem):
        return item
    # Handle your Profile Item here.

If you had the two process_item methods above setup in different pipelines, the item will go through both of them, being tracked and being processed (or ignored on the second one).

Additionally you could have one pipeline setup to handle all 'related' items:

def process_item(self, item, spider):
    if isinstance(item, ProfileItem):
        return self.handle_profile(item, spider)
    if isinstance(item, CommentItem):
        return self.handle_comment(item, spider)

def handle_profile(item, spider):
    # Handle profile here, return item

def handle_comment(item, spider):
    # Handle Comment here, return item

Or, you could make it even more complex and develop a type delegation system that loads classes and calls default handler methods, similar to how Scrapy handles middleware/pipelines. It's really up to you how complex you need it, and what you want to do.

gerosalesc · Answer 2 · 2015-09-23T18:33:36.307

Defining multiple Items it's a tricky thing when you are exporting your data if they have a relation (Profile 1 -- N Comments for instance) and you have to export them together because each item in processed at different times by the pipelines. An alternative approach for this scenario is to define a Custom Scrapy Field for example:

class CommentItem(scrapy.Item):
    profile = ProfileField()

class ProfileField(scrapy.item.Field):
   # your business here

But given the scenario where you MUST have 2 items, it is highly suggested to use a different pipeline for each one of this types of items and also different exporter instances so that you get this information in different files (if you are using files):

settings.py

ITEM_PIPELINES = {
    'pipelines.CommentsPipeline': 1,
    'pipelines.ProfilePipeline': 1,
}

pipelines.py

class CommentsPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, CommentItem):
           # Your business here

class ProfilePipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, ProfileItem):
           # Your business here

score 4 · Answer 3 · answered Nov 12 '19 at 02:58

@Rejected answer was the solution, but it needed some tweaks before it would work for me so sharing here. This is my pipeline.py:

from .items import MyFirstItem, MySecondItem # needed import of Items

    def process_item(self, item, spider):
        if isinstance(item, MyFirstItem):
            return self.handlefirstitem(item, spider) 
        if isinstance(item, MySecondItem):
            return self.handleseconditem(item, spider)

    def handlefirstitem(self, item, spider): # needed self added
        self.storemyfirst_db(item) # function to pipe it to database table
        return item

    def handleseconditem(self, item, spider): # needed self added
        self.storemysecond_db(item) # function to pipe it to database table
        return item

score 1 · Answer 4 · answered Sep 23 '15 at 16:09

1

The straightforward way is to have the parser include two sub-parsers, one for each data type. The main parser determines the type from the input and passes the string to the appropriate subroutine.

A second approach is to include the parsers in sequence: one parses Profiles and ignores all else; the second parses Comments and ignores all else (same principle as above).

Does this move you forward?

answered Sep 23 '15 at 16:09

Prune

76,765
14
60
81

Can you please share the sample parser here for the reference please – Suresh K Aug 21 '19 at 09:49
You should direct that request to the person writing the parsers. I only outlined the top level of system organization. – Prune Aug 21 '19 at 15:11

score 1 · Answer 5 · answered Apr 15 '21 at 07:35

I've come up with this solution.

I created ITEM in setting.py file

ITEMS = {
    'project.items.Item1': {
        'filename': 'item1',
    },
    'project.items.Item2': {
        'filename': 'item2',
    },
}

Imported settings in pipeline.py file

from scrapy.utils.project import get_project_settings

In open_spider method for each item from setting create file and attach exporter

for settings_key in self.settings.keys():
    filename = os.path.join(f"output/{self.settings[settings_key]['filename']}_{self.dt}.csv")
    self.settings[settings_key]['file'] = open(filename, 'wb')
    self.settings[settings_key]['exporter'] = CsvItemExporter(
        self.settings[settings_key]['file'], 
        encoding='utf-8', 
        delimiter=';', 
        quoting=csv.QUOTE_NONNUMERIC
    )
    self.settings[settings_key]['exporter'].start_exporting()

In close_spider method stop all exporters and close all files

for settings_key in self.settings.keys():
    self.settings[settings_key]['exporter'].finish_exporting()
    self.settings[settings_key]['file'].close()

In process_item method just pick the item with proper exporter and export it

item_class = f"{type(item).__module__}.{type(item).__name__}"
settings_item = self.settings.get(item_class)
if settings_item:
    settings_item['exporter'].export_item(item)
return item

score 0 · Answer 6 · answered Jul 26 '20 at 08:17

0

from python>=3.10 https://www.python.org/dev/peps/pep-0622/

probably it will be convenient to implement router (@mdkb answer) based on structural pattern matching

!items are also legacy created classes since from python>=3.7 there are data classes

answered Jul 26 '20 at 08:17

quester

534
5
18

score 0 · Answer 7 · answered Jul 26 '20 at 09:05

0

I would suggest adding a comment in ProfileItem. This way you can add multiple comments in the profile of a single person. Secondly, it will be easier to process such type of data.

answered Jul 26 '20 at 09:05

Ikram Khan Niazi

789
6
17

Scrapy, Python: Multiple Item Classes in one pipeline?

7 Answers7

Linked