4

i'm trying to fetch each product's name and price from https://www.daraz.pk/catalog/?q=risk but nothing shows up.

containers = page_soup.find_all("div",{"class":"c2p6A5"})

for container in containers:
  pname = container.findAll("div", {"class": "c29Vt5"})
  name = pname[0].text
  price1 = container.findAll("span", {"class": "c29VZV"})
  price = price1[0].text
  print(name)
  print(price)
  • you can grab the json but you have to know the number of pages in order to get all results. The only way to get the number of pages is to first let the page render e.g. use selenium then switch to requests. You also don't need to use regex as you can simply do item = soup.select('script')[2] – QHarr Dec 15 '18 at 13:48
  • yes thank you.. someone else also advised this i'm still figuring out: how do you people check that its returning json data – Subial Ijaz Dec 15 '18 at 13:51
  • F12 to open dev tools and inspect the html. I searched for the first price in the html using Ctrl + F (5,900). This showed me an occurrence of that value in a json string inside a script tag. You can see from the script syntax that this is used to update the page. You can get each page with syntax: https://www.daraz.pk/catalog/?page=1&q=risk and change the page number. You cannot, however, get the total number of pages without using a browser (AFAIK). – QHarr Dec 15 '18 at 14:00
  • So I, depending on whether timing is really an issue, would use a solution which renders the page to get the page number count then switch to requests. You can get the number of pages from the len of using selector li[class*="ant-pagination-item ant-pagination-item-"] – QHarr Dec 15 '18 at 14:00
  • 1
    thank you very much – Subial Ijaz Dec 15 '18 at 14:34
  • I was wrong. You can calculate the page count from the json. Shown below (updated). – QHarr Dec 15 '18 at 16:50

4 Answers4

3

if the page is dynamic, Selenium should take care of that

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.daraz.pk/catalog/?q=risk')

r = browser.page_source
page_soup = bs4.BeautifulSoup(r,'html.parser')

containers = page_soup.find_all("div",{"class":"c2p6A5"})

for container in containers:
  pname = container.findAll("div", {"class": "c29Vt5"})
  name = pname[0].text
  price1 = container.findAll("span", {"class": "c29VZV"})
  price = price1[0].text
  print(name)
  print(price)

browser.close() 

output:

Risk Strategy Game
Rs. 5,900
Risk Classic Board Game
Rs. 945
RISK - The Game of Global Domination
Rs. 1,295
Risk Board Game
Rs. 1,950
Risk Board Game - Yellow
Rs. 3,184
Risk Board Game - Yellow
Rs. 1,814
Risk Board Game - Yellow
Rs. 2,086
Risk Board Game - The Game of Global Domination
Rs. 975
...
chitown88
  • 27,527
  • 4
  • 30
  • 59
3

There is JSON data in the page, you can get it in the <script> tag using beautifulsoup but I dont think this is needed, because you can get it directly with json and re

import requests, json, re

html = requests.get('https://.......').text

jsonStr = re.search(r'window.pageData=(.*?)</script>', html).group(1)
jsonObject = json.loads(jsonStr)

for item in jsonObject['mods']['listItems']:
    print(item['name'])
    print(item['price'])
ewwink
  • 18,382
  • 2
  • 44
  • 54
  • @ewwink I’ve notice you’re very knowledgeable on this subject. I’m not quite sure when to use selenium, and when you can still use requests (I usually just default to selenium when requests doesn’t yield results). Is there a link/resource to help me understand that better that you know of? Sort of like a “checklist” of sorts, to look for, to know exactly at what point will only selenium be the correct choice? – chitown88 Dec 15 '18 at 17:38
  • 1
    I'm not an expert :D it simple, the rules are I will use selenium if I can't find it in `page source` or cannot replicate `XHR` or `Ajax` request. – ewwink Dec 15 '18 at 17:55
  • 1
    Thanks. I’m not really familiarized with xhr or Ajax rejects. But just by you saying that, gives me some direction though. – chitown88 Dec 15 '18 at 17:59
1

I was wrong. The info to calculate the page count is present in the json so you can get all results. No regex needed as you can extract the relevant script tag. Also, you can create the page url in a loop.

import requests
from bs4 import BeautifulSoup
import json
import math

def getNameAndPrice(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.content,'lxml')
    data = json.loads(soup.select('script')[2].text.strip('window.pageData='))
    if url == startingPage:
        resultCount = int(data['mainInfo']['totalResults'])
        resultsPerPage = int(data['mainInfo']['pageSize'])
        numPages = math.ceil(resultCount/resultsPerPage)
    result = [[item['name'],item['price']] for item in data['mods']['listItems']]   
    return result

resultCount = 0
resultsPerPage = 0
numPages = 0
link = "https://www.daraz.pk/catalog/?page={}&q=risk"
startingPage = "https://www.daraz.pk/catalog/?page=1&q=risk"
results = []
results.append(getNameAndPrice(startingPage))

for links in [link.format(page) for page in range(2,numPages + 1)]: 
    results.append(getNameAndPrice(links))
QHarr
  • 83,427
  • 12
  • 54
  • 101
1

Referring to the JSON answer to someone who is very new like me. You can use Selenium to navigate to search result page like this:

PS: Thanks for @ewwink very much. You saved my day!

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time #time delay when load web
import requests, json, re

keyword = 'fan'

opt = webdriver.ChromeOptions()
opt.add_argument('headless')
driver = webdriver.Chrome(options = opt)

# driver = webdriver.Chrome()

url = 'https://www.lazada.co.th/'
driver.get(url)

search = driver.find_element_by_name('q')
search.send_keys(keyword)
search.send_keys(Keys.RETURN)

time.sleep(3) #wait for web load for 3 secs

page_html = driver.page_source #Selenium way of page_html = webopen.read() for BS

driver.close()

jsonStr = re.search(r'window.pageData=(.*?)</script>', page_html).group(1)
jsonObject = json.loads(jsonStr)

for item in jsonObject['mods']['listItems']:
    print(item['name'])
    print(item['sellerName'])