0

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.

Each order result page contains information on 30 products.

https://www.buyma.com/buyer/2597809/sales_1.html

itempage

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']



def start_requests(self):
     with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
        reader = csv.reader(f)
        for row in reader:
            for n in range(1, 300): 
                url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
                yield scrapy.Request(
                    url=url,
                    callback=self.parse_firstpage_item,
                    dont_filter=True
                    )

def parse_firstpage_item(self, response): 
        loader = ItemLoader(item = ResearchtoolItem(), response = response)

        Conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
        product_name = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
        product_URL = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()

        for i in range(30):
            loader.add_value("Conversion_date", Conversion_date[i])
            loader.add_value("product_name", product_name[i])
            loader.add_value("product_URL", product_URL[i])
           
            yield loader.load_item()

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

The output is as follows, where each item contains multiple items of information at once.

Current status: {"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},

Ideal: [{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]

This may be due to my lack of understanding of basic for statements and yield.

K_MM
  • 35
  • 5

1 Answers1

1

You need to create a new loader each iteration

for i in range(30):
    loader = ItemLoader(item = ResearchtoolItem(), response = response)
    loader.add_value("Conversion_date", Conversion_date[i])
    loader.add_value("product_name", product_name[i])
    loader.add_value("product_URL", product_URL[i])
    
    yield loader.load_item()

EDIT:

add_value appends a value to the list. Since you had zero elements in the list, then after you append you'll have a list with one element.

In order to get the values as a string you can use a processor. Example:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class ProductItem(scrapy.Item):
    name = scrapy.Field(output_processor=TakeFirst())
    price = scrapy.Field(output_processor=TakeFirst())


class ExampleSpider(scrapy.Spider):
    name = 'exampleSpider'
    start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']

    def parse(self, response, **kwargs):
        names = response.xpath('//div[@class="card-body"]//h4/a/text()').getall()
        prices = response.xpath('//div[@class="card-body"]//h5//text()').getall()
        length = len(names)

        for i in range(length):
            loader = ItemLoader(item=ProductItem(), response=response)
            loader.add_value('name', names[i])
            loader.add_value('price', prices[i])

            yield loader.load_item()
SuperUser
  • 4,527
  • 1
  • 5
  • 24
  • Thank you, I will try to get a copy of the article. I wrote and ran it as you described in your response, and it outputs in the form I was looking for! However, the output data type is output as a list type with a single number of elements. I thought I could output it as a string. I do not know the cause. – K_MM Sep 17 '22 at 06:56
  • @K_MM I edited my answer to solve this issue as well. If it solves your problem then please accept the answer. – SuperUser Sep 17 '22 at 07:20
  • I completely overlooked the fact that the output of the add_value method is a list type. Thanks so much. – K_MM Sep 19 '22 at 02:38