How does one scrape all the products from a random website?

Question

I tried to get all the products from this website but somehow I don't think I chose the best method because some of them are missing and I can't figure out why. It's not the first time when I get stuck when it comes to this.

The way I'm doing it now is like this:

go to the index page of the website
get all the categories from there (A-Z 0-9)
access each of the above category and recursively go through all the subcategories from there until I reach the products page
when I reach the products page, check if the product has more SKUs. If it has, get the links. Otherwise, that's the only SKU.

Now, the below code works but it just doesn't get all the products and I don't see any reasons for why it'd skip some. Maybe the way I approached everything is wrong.

from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session


INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()


def retry(link):
    wait = randint(0, 10)
    try:
        return session_.get(link).text
    except Exception as e:
        print('Retrying product page in {} seconds because: {}'.format(wait, e))
        sleep(wait)
        return retry(link)


def get_category_sections():
    au = list(ascii_uppercase)
    au.remove('Q')
    au.remove('Y')
    au.append('0-9')
    return au


def get_categories():
    html_ = retry(INDEX_PAGE)
    page = html.fromstring(html_)
    sections = get_category_sections()

    for section in sections:
        for link in page.xpath("//div[@id='index-{}']//li/a/@href".format(section)):
            yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)


def dig_up_products(url):
    html_ = retry(url)
    page = html.fromstring(html_)

    for link in page.xpath(
            '//h2[contains(., "CATEGORIES")]/following-sibling::*[@id="carouselSegment2b"]//li//a/@href'
    ):
        yield from dig_up_products(link)

    for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a/@href'):
        yield link

    for link in page.xpath('//*[@id="ts_resultList"]/div/nav/ul/li[last()]/a/@href'):
        if link != '#':
            yield from dig_up_products(link)


def check_if_more_products(tree):
    more_prods = [
        all_prod
        for all_prod in tree.xpath("//div[@id='pm2_prodTableForm']//tbody/tr/td[1]//a/@href")
    ]
    if not more_prods:
        return False
    return more_prods


def main():
    for category_link in get_categories():
        for product_link in dig_up_products(category_link):
            product_page = retry(product_link)
            product_tree = html.fromstring(product_page)
            more_products = check_if_more_products(product_tree)
            if not more_products:
                print(product_link)
            else:
                for sku_product_link in more_products:
                    print(sku_product_link)


if __name__ == '__main__':
    main()

Now, the question might be too generic but I wonder if there's a rule of thumb to follow when someone wants to get all the data (products, in this case) from a website. Could someone please walk me through the whole process of discovering what's the best way to approach a scenario like this?

Which one/s is it skipping, is it the same ones every time you run it or different ones each time? — Steve Byrne, Dec 30 '17 at 23:46
As far as I can tell, different ones. But it's really hard to tell because: I get 130k products from which more than 60% are duplicates. — Cajuu', Dec 31 '17 at 08:30
"please walk me through the whole process of discovering what's the best way to approach a scenario like this?". I don't think there is a 'process' that will always work. For example, some web sites employ various anti-scraping measures to make it hard to do this. And it might also be illegaI. In the Terms and Conditions for richelieu.com, it says "It is forbidden to [...] directly or indirectly use any data mining method or tool, search bots or any similar automated tools or methods for collecting data in the Materials" (https://www.richelieu.com/filiales/RC/html/ConditionsAn.html). — mzjn, Jan 01 '18 at 14:45
@mzjn that was "_Last updated on February 1st, 2006_". However, while that may still apply, I'm doing it for learning purposes — Cajuu', Jan 01 '18 at 21:55
Why not use BeautifulSoup4? e.g. each time you find the ItemImg class, get the href from the preceeding anchor tag, follow that page into the items, use the similar method to get the actual items... — addohm, Jan 06 '18 at 18:00

Ajax1234 · Accepted Answer · 2018-01-06T23:42:20.143

If your ultimate goal is to scrape the entire product listing for each category, it may make sense to target the full product listings for each category on the index page. This program uses BeautifulSoup to find each category on the index page and then iterates over each product page under each category. The final output is a list of namedtuples stories each category name with the current page link and the full product titles for each link:

url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
   for link in links:
      page_data = str(urllib.urlopen(link).read())
      print "link: ", link
      page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
      if not page_links:
         final_page_data = soup(page_data, 'lxml')
         final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
         new_category = category(category1, link, final_titles)
         final_data.append(new_category)

      else:
         page_numbers = set(itertools.chain(*list(map(list, page_links))))

         full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
         for page_result in full_page_links:
            new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
            final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
            new_category = category(category1, link, final_titles)
            final_data.append(new_category)

print final_data

The output will garner results in the format:

[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....

To access each attribute, call like so:

categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]

I believe the benefit of using BeautifulSoup is this instance is that it provides a higher level of control over the scraping and is easily modified. For instance, should the OP change his mind regarding what facets of the product/index he would like to scrape, simple changes in the find_all parameters should only be needed, as the general structure of the code above centers around each product category from the index page.

Even though this is a nice attempt to solve the problem, I am not sure if the OP is going to switch to it. We've already gone through multiple [rounds of reviews](https://codereview.stackexchange.com/q/180772/24208) with this code and even [suggested a working Scrapy spider](https://codereview.stackexchange.com/a/182196/24208) which is going to outperform both of the OP's and this solution by a serious margin. I don't know the OP's motivation but I think here he wanted to get more understanding why his approach does not scrape all the data. Does yours and how do you know? Thanks. — alecxe, Jan 06 '18 at 23:24

score 3 · Answer 2 · answered Jan 02 '18 at 02:27

First of all, there is no definite answer to your generic question of how would one know if the data one has already scraped is all the available data. This is at least web-site specific and is rarely actually revealed. Plus, the data itself might be highly dynamic. On this web-site though you may more or less use the product counters to verify the amount of results found:

Your best bet here would be to debug - use logging module to print out information while scraping, then analyze the logs and look for why there was a missing product and what caused that.

Some of the ideas I currently have:

could it be that the retry() is the problematic part - could it be that session_.get(link).text does not raise an error but does not contain the actual data in the response as well?
I think the way you extract category links is correct and I don't see you missing categories on the index page
the dig_up_products() is questionable: when you extract links to the subcategories, you have this carouselSegment2b id used in the XPath expression, but I see that on at least some of the pages (like this one) the id value is carouselSegment1b. In any case, I would probably do just //h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/@href here
I also don't like that imgWrapper class used to find a product link (could be that products missing images are missed?). Why not just: //ul[@id="prodResult"]/li//a/@href - this would though bring in some duplicates which you can address separately. But, you can also look for the link in the "info" section of the product container: //ul[@id="prodResult"]/li//div[contains(@class, "infoBox")]//a/@href.

There can also be an anti-bot, anti-web-scraping strategy deployed that may temporarily ban your IP or/and User-Agent or even obfuscate the response. Check for that too.

Ruud Helderman · Answer 3 · 2018-01-06T23:39:18.767

As pointed out by @mzjn and @alecxe, some websites employ anti-scraping measures. To hide their intentions, scrapers should try to mimic a human visitor.

One particular way for websites to detect a scraper, is to measure the time between subsequent page requests. Which is why scrapers typically keep a (random) delay between requests.

Besides, hammering a web server that is not yours without giving it some slack, is not considered good netiquette.

From Scrapy's documentation:

RANDOMIZE_DOWNLOAD_DELAY

Default: True

If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.

This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.

The randomization policy is the same used by wget --random-wait option.

If DOWNLOAD_DELAY is zero (default) this option has no effect.

Oh, and make sure the User-Agent string in your HTTP request resembles that of an ordinary web browser.

How does one scrape all the products from a random website?

3 Answers3