0

I'm trying to scrape this for product names and prices.

There's a load more button at the bottom of the page, I've tried using postman to modify the form data and 'productBeginIndex': and 'resultsPerPage': seem to modify the number of products shown.

However, I'm unsure what's wrong with my code - it still returns the 24 products no matter how I tweak the values. I've tried using FormRequest.from_response() but it still just returns 24 products.

import scrapy


class PriceSpider(scrapy.Spider):
    name = "products"
    def parse(self, response):
        return [scrapy.FormRequest(url="https://www.fairprice.com.sg/baby-child",
                                   method='POST',
                                   formdata= {'productBeginIndex': '1', 'resultsPerPage': '1', },
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
      name = response.css("img::attr(title)").extract()
      price = response.css(".pdt_C_price::text").extract()

      for item in zip(name, price):
          scraped_info = {
                  "title" : item[0],
                  "value" : item[1]
                   }
          yield scraped_info

Could someone please tell me what I'm missing? And how could I implement a loop to extract all the objects in the category?

Thank you so much!

VijayD
  • 826
  • 1
  • 11
  • 33
chiff
  • 1
  • 1

1 Answers1

0

You should post to (get method will work, too) /ProductListingView instead of /baby-child.

To scrape all objects, modify parameters beginIndex in the loop and yield a new request.(By the way, modify productBeginIndex will not work)

We don't know the total number of products, so a safe way is to crawl a group of products every time. By modifying custom_settings, you can easily control where to begin and how many products to scrape.

As for how to output to CSV format file, refer to Scrapy pipeline to export csv file in the right format

For convenience, I add the PriceItem class below, you may add it to items.py. Using command scrapy crawl PriceSpider -t csv -o test.csv, you will get a test.cvs file. Or, you can try CSVItemExporter

# OUTPUTS
# 2018-08-15 16:00:08 [PriceSpider] INFO: ['Nestle Nan Optipro Gro Growing Up Milk Formula -Stage 3', 'Friso Gold Growing Up Milk Formula - Stage 3']
# 2018-08-15 16:00:08 [PriceSpider] INFO: ['\n\t\t\t\t\t$199.50\n\t\t\t\t', '\n\t\t\t\t\t$79.00\n\t\t\t\t']
# 2018-08-15 16:00:08 [PriceSpider] INFO: ['Aptamil Gold+ Toddler Growing Up Milk Formula - Stage 3', 'Aptamil Gold+ Junior Growing Up Milk Formula - Stage 4']
# 2018-08-15 16:00:08 [PriceSpider] INFO: ['\n\t\t\t\t\t$207.00\n\t\t\t\t', '\n\t\t\t\t\t$180.00\n\t\t\t\t']
#
# \n and \t is not a big deal, just strip() it

import scrapy

class PriceItem(scrapy.Item):
  title = scrapy.Field()
  value = scrapy.Field()

class PriceSpider(scrapy.Spider):
  name = "PriceSpider"

  custom_settings = {
    "BEGIN_PAGE" : 0,
    "END_PAGE" : 2,
    "RESULTS_PER_PAGE" : 2,
  }

  def start_requests(self): 

    formdata = {
      "sType" : "SimpleSearch",
      "ddkey" : "ProductListingView_6_-2011_3074457345618269512",
      "ajaxStoreImageDir" : "%2Fwcsstore%2FFairpriceStorefrontAssetStore%2F",
      "categoryId" : "3074457345616686371",
      "emsName" : "Widget_CatalogEntryList_701_3074457345618269512",
      "beginIndex" : "0",
      "resultsPerPage" : str(self.custom_settings["RESULTS_PER_PAGE"]),
      "disableProductCompare" : "false",
      "catalogId" : "10201",
      "langId" : "-1",
      "enableSKUListView" : "false",
      "storeId" : "10151",
    }

    # loop to scrape different pages
    for i in range(self.custom_settings["BEGIN_PAGE"], self.custom_settings["END_PAGE"]):
      formdata["beginIndex"] = str(self.custom_settings["RESULTS_PER_PAGE"] * i)

      yield scrapy.FormRequest(
        url="https://www.fairprice.com.sg/ProductListingView",
        formdata = formdata,
        callback=self.logged_in
      )

  def logged_in(self, response):
      name = response.css("img::attr(title)").extract()
      price = response.css(".pdt_C_price::text").extract()

      self.logger.info(name)
      self.logger.info(price)

      # Output to CSV: refer to https://stackoverflow.com/questions/29943075/scrapy-pipeline-to-export-csv-file-in-the-right-format
      # 
      for item in zip(name, price):
        yield PriceItem(
          title = item[0].strip(),
          value = item[1].strip()
        )
Vic
  • 123
  • 1
  • 5
  • Thank you so much @Vic! I'm just wondering: 1. So instead of Form.Request I have to use start_requests first? 2. For the formdata, do I have to fill in all the fields like you did, or would just filling in beginIndex and resultsPerPage work? 3. How do you find out that it's beginIndex and not productBeginIndex? I looked into the network response (form data) and it was listed as productBeginIndex... 4. How do you output the data as a csv file? I adding a yield in the last line of your code and the csv file came up empty. Again, thank you so much, really appreciate your help! – chiff Aug 16 '18 at 07:42
  • @chiff 1. According to [scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html) you should implement `start_requests` or define a `start_urls` attribute. I prefer the former because it's easy to control which pages to scrape. 2. Only fill `beginIndex` and `resultsPerPage` may work.(I tried just now, sometimes it worked sometimes it didn't) 3. I noticed that `beginIndex` is same as `productBeginIndex` so I tried modify `beginIndex`, and it worked. 4. I updated my answer. – Vic Aug 17 '18 at 11:08