Why in Scrapy scraped text is as string in spider, but as list in pipeline?

Question

Can somebody explain this to me, please?
In my spider, I have code for extracting data using XPath.

price_euro = add.xpath('.//strong[@class="price price--eur"]/text()').extract_first()
print 'price_euro', price_euro, type(price_euro)

and what I get is:

price_euro 25.500  <type 'unicode'>

and I understand this, I have it as a string(Unicode) because I have used .extract_first() and this is what I want.

But in my pipeline,

print "item['price_euro']", item['price_euro'], type(item['price_euro'])

I have it as a list

item['price_euro'] [u'25.500 '] <type 'list'>

This is not the big problem for me, but it is annoying because every time when I want to access it I need to add [0] at end of it. eg. item['price_euro'][0]

Can I disable this and should I?
What is the logic behind this?

Thank you

How I add price_euro

l = ItemLoader(item=MyItem(), response=response)
l.add_value('price_euro', price_euro)      
yield l.load_item()

where do you assign the price to the item ? what does the entire item look like when printed? — omu_negru, Nov 04 '17 at 09:28

score 3 · Accepted Answer · answered Nov 07 '17 at 11:43

The ItemLoader allows calling add_value() (as well as add_css() and add_xpath()) multiple times for the same field. This is helpful when the information you are looking for can be found in multiple places of the HTML source, or when the HTML layout differs between requests. To accommodate this, the item loader stores all field values inside lists.

When you expect exactly one value for the field (as for your price information), you can tell the item loader how to convert the list when load_item() is called by specifying an output processor. The canonical way to do this is by subclassing the ItemLoader class:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    price_euro_out = TakeFirst()

You can then populate this item loader as before, with the additional upside that you no longer have to tell the item loader which item type to use:

l = MyItemLoader(response=response)
l.add_value('price_euro', price_euro)      
yield l.load_item()

For the example code you posted, you can even avoid your manual extraction via the add_xpath() method and passing add as selector keyword argument to the item loader:

l = MyItemLoader(selector=add)
l.add_xpath('price_euro', './/strong[@class="price price--eur"]/text()')      
yield l.load_item()

If you want to enable this "take the first list element" behaviour for all fields of your item, you can also declare a default output processor for your item loader:

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    default_output_processor = TakeFirst()

The Scrapy docs have a list of built-in processors.

thank you for your answer, now thinks make sense. To solve this problem I made a pipeline to get the only first thing. I am using Scrapy for last week, it is a very good framework, once you understand the logic behind it. I will try to use this approach. — WebOrCode, Nov 07 '17 at 12:01

Why in Scrapy scraped text is as string in spider, but as list in pipeline?

1 Answers1