1

I am trying to scrape the data from this webpage, and i am successfully able to scrape the data what i need.
Problem is the downloaded page using requests has only 45 product details but actually on that webpage it has more than 4000 products, this is happening because all data is not available directly it shows only if you scroll down to the page.
I would like to scrape all products that is available on the page.

CODE

import requests
from bs4 import BeautifulSoup
import json
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

base_url = "link that i provided"
r = requests.get(base_url,headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

scripts = soup.find_all('script')[11].text
script = scripts.split('=', 1)[1]
script = script.rstrip()
script = script[:-1]

data = json.loads(script) 

skus = list(data['grid']['entities'].keys())

prodpage = []
for sku in skus:
   prodpage.append('https://www.ajio.com{}'.format(data['grid']['entities'][sku]['url']))

print(len(prodpage))   
james joyce
  • 483
  • 7
  • 24
  • Does this answer your question? [How to load all entries in an infinite scroll at once to parse the HTML in python](https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho) – dspencer Apr 08 '20 at 09:59
  • 1
    Question is same but i don't think the answer is given according to op..my data is hidden inside javascript which i convert to json......i have already extracted data..so if possible i just want all data to be present in the html and javascript that i pass to `requests` from there i can work on it – james joyce Apr 08 '20 at 10:09

1 Answers1

4

Scrolling down means the data is being generated by JavaScript so you have more than one option here first one is to use selenium second one is to send the same Ajax request the website is using as follows :

def get_source(page_num = 1):
        url = 'https://www.ajio.com/api/category/830216001?fields=SITE&currentPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'

        res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
        if res.status_code == 200 :
                return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
        print(f'Getting page {i}')
        data = get_source(page_num = i)['products']
        for item in data:
                prodpage.append('https://www.ajio.com{}'.format(item['url']))
        if i == 3: break
print(len(prodpage)) # output 135 for 3 pages 
Ahmed Soliman
  • 1,662
  • 1
  • 11
  • 16