I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.
The way I'm doing it now is like this:
- go to the index page of the website
- get all the categories from there (A-Z 0-9)
- access each of the above category and recursively go through all the subcategories from there until I reach the products page
- when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.
Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.
from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session
INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()
def retry(link):
wait = randint(0, 10)
try:
return session_.get(link).text
except Exception as e:
print('Retrying product page in {} seconds because: {}'.format(wait, e))
sleep(wait)
return retry(link)
def get_category_sections():
au = list(ascii_uppercase)
au.remove('Q')
au.remove('Y')
au.append('0-9')
return au
def get_categories():
html_ = retry(INDEX_PAGE)
page = html.fromstring(html_)
sections = get_category_sections()
for section in sections:
for link in page.xpath("//div[@id='index-{}']//li/a/@href".format(section)):
yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)
def dig_up_products(url):
html_ = retry(url)
page = html.fromstring(html_)
for link in page.xpath(
'//h2[contains(., "CATEGORIES")]/following-sibling::*[@id="carouselSegment2b"]//li//a/@href'
):
yield from dig_up_products(link)
for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a/@href'):
yield link
for link in page.xpath('//*[@id="ts_resultList"]/div/nav/ul/li[last()]/a/@href'):
if link != '#':
yield from dig_up_products(link)
def check_if_more_products(tree):
more_prods = [
all_prod
for all_prod in tree.xpath("//div[@id='pm2_prodTableForm']//tbody/tr/td[1]//a/@href")
]
if not more_prods:
return False
return more_prods
def main():
for category_link in get_categories():
for product_link in dig_up_products(category_link):
product_page = retry(product_link)
product_tree = html.fromstring(product_page)
more_products = check_if_more_products(product_tree)
if not more_products:
print(product_link)
else:
for sku_product_link in more_products:
print(sku_product_link)
if __name__ == '__main__':
main()
Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?