0

I have a website that's need to scrape the data "https://www.forever21.com/us/shop/catalog/category/f21/sale#pageno=1&pageSize=120&filter=price:0,250&sort=5" but I cannot retrieve all the data it also has pagination and Its uses javascript as well.

any idea on how I will scrape all the items? Here's my code

def parse_2(self, response):


    for product_item_forever in response.css('div.pi_container'):
        item = GpdealsSpiderItem_f21()

        f21_title = product_item_forever.css('p.p_name::text').extract_first()
        f21_regular_price = product_item_forever.css('span.p_old_price::text').extract_first()
        f21_sale_price = product_item_forever.css('span.p_sale.t_pink::text').extract_first()
        f21_photo_url = product_item_forever.css('img::attr(data-original)').extract_first()
        f21_description_url = product_item_forever.css('a.item_slider.product_link::attr(href)').extract_first()

        item['f21_title'] = f21_title 
        item['f21_regular_price'] = f21_regular_price 
        item['f21_sale_price'] = f21_sale_price 
        item['f21_photo_url'] = f21_photo_url 
        item['f21_description_url'] = f21_description_url 

        yield item

Please help Thank you

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39

1 Answers1

0

One of the first steps in web scraping project should be looking for an API that the website uses to get the data. Not only does it save you parsing HTML, using an API also saves provider's bandwidth and server load. To look for an API, use your browser's developer tools and look for XHR requests in the network tab. In your case, the web site makes POST requests to this URL:

https://www.forever21.com/eu/shop/Catalog/GetProducts

You can then simulate the XHR request in Scrapy to get the data in JSON format. Here's the code for the spider:

# -*- coding: utf-8 -*-
import json
import scrapy

class Forever21Spider(scrapy.Spider):
    name = 'forever21'

    url = 'https://www.forever21.com/eu/shop/Catalog/GetProducts'
    payload = {
        'brand': 'f21',
        'category': 'sale',
        'page': {'pageSize': 60},
        'filter': {
            'price': {'minPrice': 0, 'maxPrice': 250}
        },
        'sort': {'sortType': '5'}
    }

    def start_requests(self):
        # scrape the first page
        payload = self.payload.copy()
        payload['page']['pageNo'] = 1
        yield scrapy.Request(
            self.url, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                     'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse, meta={'pageNo': 1}
        )

    def parse(self, response):
        # parse the JSON response and extract the data
        data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = {
                'title': product['DisplayName'],
                'regular_price': product['OriginalPrice'],
                'sale_price': product['ListPrice'],
                'photo_url': 'https://www.forever21.com/images/default_330/%s' % product['ImageFilename'],
                'description_url': product['ProductShareLinkUrl']
            }
            yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
                self.url, method='POST', body=json.dumps(payload),
                headers={'X-Requested-With': 'XMLHttpRequest',
                         'Content-Type': 'application/json; charset=UTF-8'},
                callback=self.parse, meta={'pageNo': payload['page']['pageNo']}
            )
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • Thanks @Tomáš Linhart, everything works fine, I am trying to add it on my current spider now wonder if you could take a look on this one as well, really appreciate your help thank you https://stackoverflow.com/questions/55761521/one-spider-with-2-different-url-and-2-parse-using-scrapy – Christian Read Apr 19 '19 at 12:03