0

I'm using BeautifulSoup to grab all of the product URL's from a website that has 58 pages. I'm using a for loop combined with f-strings because I noticed that each page has exactly the same link except for the page number. However, I noticed that each iteration of the loop refers to page 1, therefore my program would retrieve the same 36 links from page 1, 58 times (so my list of URL's would repeat every 36 times).

My hypothesis is that due to the way the web pages are formatted, using an f-string would only take me to page 1. For example, this is page 8, and every time you load up the webpage, it briefly displays page 1, delays for a bit and loads, and then finally displays page 8.

I also noticed that the pages aren't "discrete", meaning that if you're on a specific page and you scroll down (after 36 products), then it would automatically take you to the next page (after briefly loading) and the URL would change accordingly, and likewise, if you scroll up, then it would take you to the previous page. Other websites (like Amazon.com) aren't like this: you'd have to physically click on the "next page" button (or another button) in order to view more results.

Is there a method to get around this issue?

productlinks = []
for x in range(1, 59):
    r = requests.get(f'https://www.yesstyle.com/en/beauty-face-cleansers/list.html/bcc.15545_bpt.46#/sb=136&bcc=15545&l=1&bt=37&pn={x}&s=10&bpt=46',
                    headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    productlist = soup.find_all('div', class_='itemContainer')
    for item in productlist:
        for link in item.find_all('a', href=True):
            productlinks.append(link['href'])
  • Have you tried using selenium with your BeautifulSoup? – 0m3r May 25 '22 at 07:28
  • This is my very first time webscraping so I've never used selenium before. Would selenium be better to use in this scenario? – itsanhtuanho May 25 '22 at 07:32
  • Use both, load the page with selenium then scrape on BeautifulSoup - https://stackoverflow.com/a/61313886/4539709 – 0m3r May 25 '22 at 07:45

1 Answers1

0

The data for each page is being loaded via another request which is why you see it change after loading. The request being made gets all the data from JSON. As mentioned, using the requests library wont run the Javascript and do this step, you need to do it yourself.

You can make the API call yourself to get the information you want (I suggest you print(data) to see exactly what is available to you to use).

Also, you need to get the security headers for this to work, these can be found when accessing the site for the first time and then included with each request:

import requests
from bs4 import BeautifulSoup
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
    'Referer' : 'https://www.yesstyle.com/en/beauty-face-cleansers/list.html/bcc.15545_bpt.46',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
}

productlinks = []

with requests.Session() as session:
    req_root = session.get('https://www.yesstyle.com/', headers=headers)
    soup_root = BeautifulSoup(req_root.content, 'lxml')
    
    for script in soup_root.find_all('script'):
        if script.string and 'authUrl' in script.string:
            data = json.loads(script.string.split(' = ', 1)[1])
            headers |= data['security']
            break
        
    for x in range(1, 3):
        print(f"Page {x}")
        
        r = session.get(f'https://www.yesstyle.com/rest/products/v1/department/15545?bcc=15545&bpt=46&bt=37&l=1&pn={x}&s=10&sb=136&yet=1',
                        headers=headers)
        
        data = r.json()
        
        for product in data['products']:
            p = product['product']
            print(f"  {p['brandName']} - {p['name']}")

This would give you output such as:

Page 1
  COSRX - Low pH Good Morning Gel Cleanser
  heimish - All Clean Balm 120ml
  THE FACE SHOP - Rice Water Bright Light Cleansing Oil 150ml
  iUNIK - Calendula Complete Cleansing Oil 200ml
  ISEHAN - Kiss Me Heroine Make Speedy Mascara Remover
  COSRX - Salicylic Acid Daily Gentle Cleanser
  B.LAB - Matcha Hydrating Foam Cleanser
  SOME BY MI - Pure Vitamin C V10 Cleansing Bar 1pc
  Beauty of Joseon - Radiance Cleansing Balm NEW
  Pyunkang Yul - Low pH Pore Deep Cleansing Foam
  Kose - Softymo Cleansing Oil 230ml - 3 Types
  Rohto Mentholatum - Hada Labo Gokujyun Oil Cleansing
  Shiseido - Senka Perfect Whip Face Wash
  SOME BY MI - Bye Bye Blackhead 30 Days Miracle Green Tea Tox Bubble Cleanser
  ROVECTIN - Skin Essentials Conditioning Cleanser
  THE FACE SHOP - Rice Water Bright Cleansing Foam 150ml
  Dear, Klairs - Gentle Black Deep Cleansing Oil
  Rohto Mentholatum - Hada Labo Gokujyun Hyaluronic Acid Face Foam
  ETUDE - Soon Jung pH 6.5 Whip Cleanser
  SIORIS - Cleanse Me Softly Milk Cleanser
  SOME BY MI - AHA, BHA, PHA 30 Days Miracle Cleansing Bar 1pc
  Dear, Klairs - Rich Moist Foaming Cleanser (Renewal)
  Rohto Mentholatum - Hada Labo Gokujyun Oil Cleansing Refill
  iUNIK - Centella Bubble Cleansing Foam 150ml
  ETUDE - AC Clean Up Pink Powder Spot 15ml
  RiRe - All Kill Blackhead Remover Stick
  KUMANO COSME - Pharmaact Deep Cleansing Oil
  NEOGEN - Dermalogy Real Fresh Foam Green Tea
  Beauty of Joseon - Green Plum Refreshing Cleanser
  SOME BY MI - AHA,BHA,PHA 30 Days Miracle Acne Clear Foam
  Haruharu WONDER - Black Rice Moisture 5.5 Soft Cleansing Gel
  heimish - All Clean Green Foam 150ml
  innisfree - Jeju Volcanic Blackhead 3-Step Program
  innisfree - Jeju Volcanic Pore Cleansing Foam
  PURITO - From Green Cleansing Oil - 3 Types
  By Wishtrend - Green Tea & Enzyme Powder Wash JUMBO
Page 2
  Haruharu WONDER - Black Rice Moisture Deep Cleansing Oil
  ETUDE - Soon Jung pH 5.5 Foam Cleanser
  B.LAB - PHA Perfect Pore Cleansing Oil JUMBO
  Rohto Mentholatum - Hada Labo Gokujyun Hyaluronic Acid Face Wash
  MIZON - Snail Repairing Foam Cleanser
  innisfree - Green Tea Foam Cleanser
  Dear, Klairs - Pore Gentle Black Sugar Charcoal Soap
  SIORIS - Day By Day Cleansing Gel
  Pyunkang Yul - Deep Clear Cleansing Balm
  Dear, Klairs - Gentle Black Fresh Cleansing Oil
  COSRX - Low pH Good Morning Gel Cleanser Mini
  COSRX - Advanced Snail Mucin Gel Cleanser
  THE FACE SHOP - Rice Water Bright Lip & Eye Makeup Remover 120ml
  Beauty of Joseon - Rice Duo YesStyle Exclusive Kit
  Shiseido - Senka Perfect Whip Face Wash Collagen In
  make p:rem - Safe Me. Relief Moisture Cleansing Foam 150ml
  I'm from - Fig Cleansing Balm
  KLAVUU - Pure Pearlsation Revitalizing Facial Cleansing Foam 130ml
  PURITO - Defence Barrier pH Cleanser
  SIORIS - Fresh Moment Cleansing Oil
  MIZON - Cicaluronic Cleansing Balm
  RiRe - Style Black Head Brush Cleanser 20ml
  THE FACE SHOP - Rice Water Bright Rich Facial Cleansing Oil 150ml
  Pyunkang Yul - Calming Low pH Foaming Cleanser
  SKINFOOD - Egg White Perfect Pore Cleansing Foam 150ml
  Pyunkang Yul - Deep Cleansing Oil
  heimish - All Clean Balm Mini
  COSRX - Favorites Best Sellers Set
  THE FACE SHOP - Rice Water Bright Rice Bran Facial Foaming Cleanser
  Kose - Softymo Cleansing Oil Refill 200ml - 3 Types
  Cure - Natural Aqua Gel
  heimish - All Clean Mini Kit
  THE FACE SHOP - Rice Water Bright Cleansing Foam 300ml
  COSRX - Advanced Snail Mucin Gel Cleanser Mini
  ETUDE - Soon Jung Lip And Eye Remover
  iUNIK - Centella Mild Cleansing Foam

URLs are also available in the data if needed.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • In the second for loop, where/how did you get the URL? r = session.get(f'https://www.yesstyle.com/rest/products/v1/department/15545?bcc=15545&bpt=46&bt=37&l=1&pn={x}&s=10&sb=136&yet=1', headers=headers) – itsanhtuanho Jun 01 '22 at 01:43
  • I used the network tools on my browser to see the request it was making and then duplicated that request using Python – Martin Evans Jun 01 '22 at 09:24