1

I have a python list containing these five weblinks:

https://en-ae.namshi.com/brands/buy-a-little-lovely-company-little-angel-light-w1030269a.html
https://en-ae.namshi.com/brands/buy-a-little-lovely-company-rose-bud-teething-ring-w1030273a.html
https://en-ae.namshi.com/brands/buy-a-little-lovely-company-cloud-projector-light-w869154a.html
https://en-ae.namshi.com/brands/buy-a-little-lovely-company-bunny-projector-light-w869153a.html
https://en-ae.namshi.com/brands/buy-a-little-lovely-company-little-fairy-light-w1030270a.html

I am trying to loop through the links to extract certain elements from the page , the extraction works fine for most of the elements but I am unable to get the "RATING" and the "NUM_REVIEWS" from the webpage to fill in the column. Can someone please help me in getting these. Thanks

Working code:

import pandas as pd 
import requests
from bs4 import BeautifulSoup
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

list_urls = [
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-little-angel-light-w1030269a.html',
'https://en-ae.namshi.com/buy-american-eagle-straight-dark-wash-jeans-w925887a.html',
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-rose-bud-teething-ring-w1030273a.html',
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-cloud-projector-light-w869154a.html',
]

all_data = []
for lnk in list_urls:
    page=requests.get(lnk)
    tree = html.fromstring(page.content)
    f = requests.get(lnk, headers=headers).text
    hun = BeautifulSoup(f,'html.parser')

    product_name=hun.find("h1",{"class":"product__name"}).text.replace('\n',"")
    brand_name = hun.find("h2",{"class":"product__brandname"}).text.replace('\n',"")

    price = str(tree.xpath('//*[@id="content"]/div/div[3]/section[1]/div/div[1]/div/div[2]/header/div/p[1]/span[1]/text()')[0])
    reduced_price = str(tree.xpath('//*[@id="content"]/div/div[3]/section[1]/div/div[1]/div/div[2]/header/div/p[1]/span[2]/text()')[0])
    rating = str(tree.xpath('/html/body/div[1]/div[7]/div/div[3]/section[1]/div/div[2]/div/div[1]/text()'))
    num_reviews = str(tree.xpath('/html/body/div[1]/div[7]/div/div[3]/section[1]/div/div[2]/div/div[1]/div[1]/span[2]/text()'))
    
    sub_cat_1 = str(tree.xpath('//*[@id="content"]/div/div[3]/ul/li[3]/a/text()')[0])
    sub_cat_2 = str(tree.xpath('//*[@id="content"]/div/div[3]/ul/li[4]/a/text()')[0])
    sub_cat_3 = str(tree.xpath('//*[@id="content"]/div/div[3]/ul/li[5]/a/text()')[0])

    row = {"Product_Name": product_name, "Brand_Name" : brand_name, "Original_Price" : price, 
           "Discounted_Price" : reduced_price ,"Rating" : rating, "Num_Reviews" : num_reviews, 
           "Sub_cat_1" : sub_cat_1, "Sub_cat_2" : sub_cat_2, "Sub_cat_3" : sub_cat_3}

    all_data.append(row)

df = pd.DataFrame(all_data)

print(df.head(5))

Can you please help me in getting the ratings and reviews as well , Thanks in advance.

The Ratings and num reviews columns are blank

Expected values in both cols :

expected values in rating and num_reviews

abhishake
  • 131
  • 1
  • 12
  • your rating xpath doesn't work for any of those links. For those links, please indicate what value should be retrieved. – QHarr Jun 11 '21 at 15:17
  • @QHarr I have added image to indicate value on page , Thanks , really appreciate your help. – abhishake Jun 11 '21 at 15:59
  • @QHarr You might need to scroll down a bit in order to see the ratings section – abhishake Jun 11 '21 at 16:01
  • @QHarr The link I used in screenshot : https://en-ae.namshi.com/buy-american-eagle-straight-dark-wash-jeans-w925887a.html – abhishake Jun 11 '21 at 16:01

1 Answers1

0

Those ratings come from a dynamic request to another endpoint based on current sku. This shows you how to get the data.

This:

if 'message' in review_info:

Checks that there are actually ratings.

I use Session for efficiency of tcp re-use with multiple requests.

import pandas as pd 
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

list_urls = [
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-little-angel-light-w1030269a.html',
'https://en-ae.namshi.com/buy-american-eagle-straight-dark-wash-jeans-w925887a.html',
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-rose-bud-teething-ring-w1030273a.html',
'https://en-ae.namshi.com/brands/buy-a-little-lovely-company-cloud-projector-light-w869154a.html',
]

all_data = []

with requests.Session() as s:
    s.headers = headers
    for lnk in list_urls:
        f = s.get(lnk).text
        sku = re.search(r'"sku":"(.*?)"', f).group(1)
        review_info = s.get(f'https://en-ae.namshi.com/_svc/reviews/{sku}').json()
        print(lnk)
        if 'message' in review_info:
            print('No averageRating')
            print('No totalAvailableRatings')
        else:
            print(review_info['averageRating'])
            print(review_info['totalAvailableRatings'])
        hun = BeautifulSoup(f,'html.parser')
 
QHarr
  • 83,427
  • 12
  • 54
  • 101