15

I'm trying to scrape product names from a website. Oddly, I seem to only scrape random 12 items. I've tried both HtmlAgilityPack and with HTTPClient and I get the same random results. Here's my code for HtmlAgilityPack:

using HtmlAgilityPack;
using System.Net.Http;

var url = @"http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
HtmlWeb web = new HtmlWeb();
var doc = web.Load(url, "GET", proxy, new NetworkCredential(PROXY_UID, PROXY_PWD, PROXY_DMN));
var nodes = doc.DocumentNode.Descendants("div")
            .Where(div => div.GetAttributeValue("class", string.Empty) == "product-name")
            .Select(div => div.InnerText.Trim())
            ;

[UPDATE 1] @CodingKuma suggested I try Selenium Webdriver. Here's my code using Selenium Webdriver:

IWebDriver chromeDriver = new ChromeDriver(@"C:\TEMP\Projects\Chrome\chromedriver_win32");
chromeDriver.Url = "http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
var items = chromeDriver.FindElements(By.ClassName("product-name"));
items.Count().Dump();
chromeDriver.Quit();

I tried this code but still no luck. There are over 20 items on that page, but I seem to only get a random 12. How can I scrape all items on that site?

inquisitive_one
  • 1,465
  • 7
  • 32
  • 56

4 Answers4

5

Since the v1.5.0-beta92,

HtmlAgilityPack has a FromBrowser method that allows you to wait until all elements you want are ready.

Documentation: http://html-agility-pack.net/from-browser

string url = "http://html-agility-pack/from-browser";

var web1 = new HtmlWeb();
var doc1 = web1.LoadFromBrowser(url, o =>
{
    var webBrowser = (WebBrowser) o;

    // WAIT until the dynamic text is set
    return !string.IsNullOrEmpty(webBrowser.Document.GetElementById("uiDynamicText").InnerText);
});
var t1 = doc1.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText

var web2 = new HtmlWeb();
var doc2 = web2.LoadFromBrowser(url, html =>
{
    // WAIT until the dynamic text is set
    return !html.Contains("<div id=\"uiDynamicText\"></div>");
});
var t2 = doc2.DocumentNode.SelectSingleNode("//div[@id='uiDynamicText']").InnerText

Console.WriteLine("Text 1: " + t1);
Console.WriteLine("Text 2: " + t2);

The trick here is to find something that tells you when the page is ready since it's impossible for the library to know.

Jonathan Magnan
  • 10,874
  • 2
  • 38
  • 60
  • Did you try this on the site OP posted? I don't think this will work because it's using a lazy loader. The page is done loading, you have to scroll down to the bottom and THEN wait for the page to finish loading... see my answer for more details. – JeffC Aug 04 '17 at 18:55
  • @JeffC, no I didn't try. However the same result can be achieved since he has access to the WebBrowser and can use some API like webBrowser.Document.Window.ScrollTo(0, webBrowser.Document.Body.ScrollRectangle.Height); – Jonathan Magnan Aug 04 '17 at 20:19
4

So there are a couple issues that prevent the count from being correct.

  1. The page has a lazy loader. You have to scroll down to trigger the load of the items over 12.

  2. The page uses AJAX calls to load the items over 12.

So, you need to navigate to the page, scroll to the bottom of the page, wait for AJAX to complete, and then scrape the page. The code below is tested and returns 20 items.

The script

String url = "http://www.roots.com/ca/en/men/tops/shirts-and-polos/";
driver.navigate().to(url);
JavascriptExecutor js = ((JavascriptExecutor) driver);
int height = 1;
int lastHeight = 0;
while (lastHeight != height)
{
    lastHeight = height;
    js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
    height = (int) (long) js.executeScript("return document.body.scrollHeight;");
}

waitForJSandJQueryToLoad(10);

List<WebElement> products = driver.findElements(By.cssSelector("div.product-name"));
System.out.println(products.size());
for (WebElement e : products)
{
    System.out.println(e.getText());
}

Support function

public boolean waitForJSandJQueryToLoad(int timeOut)
{
    WebDriverWait wait = new WebDriverWait(driver, timeOut);

    ExpectedCondition<Boolean> jQueryIsLoaded = new ExpectedCondition<Boolean>()
    {
        @Override
        public Boolean apply(WebDriver driver)
        {
            return (Boolean) ((JavascriptExecutor) driver).executeScript("return (window.jQuery != null) && (jQuery.active === 0);");
        }
    };

    ExpectedCondition<Boolean> jsIsLoaded = new ExpectedCondition<Boolean>()
    {
        @Override
        public Boolean apply(WebDriver driver)
        {
            return (Boolean) ((JavascriptExecutor) driver).executeScript("return document.readyState == 'complete'");
        }
    };

    return wait.until(jQueryIsLoaded) && wait.until(jsIsLoaded);
}

Output

20
Rideau Flannel Shirt
Westridge Denim Shirt
Rideau Flannel Shirt
Riverside Plaid Shirt
Riverside Plaid Shirt
Heritage Peppered Polo
Heritage Peppered Polo
Heritage Peppered Polo
Cedar Jersey Polo
Cedar Jersey Polo
Hope River Shirt
Hawthorne Surplus Shacket
Acadian Linen Shirt
Camp Short Sleeve Shirt
Foxley Short Sleeve Shirt
Heritage Peppered Polo
Foxley Short Sleeve Shirt
Waterway Indigo Shirt
Waterway Indigo Shirt
Resolute Flannel Shirt
JeffC
  • 22,180
  • 5
  • 32
  • 55
3

For most single page apps or pages that load content dynamically you better off using an actual browser to navigate the pages. I'd suggest looking into selenium for this type of setup.

https://www.nuget.org/packages/Selenium.WebDriver

CodingKuma
  • 413
  • 6
  • 10
  • That doesn't work either. Here's my code: `IWebDriver chromeDriver = new ChromeDriver(@"C:\TEMP\Projects\Chrome\chromedriver_win32"); chromeDriver.Url = "http://www.roots.com/ca/en/men/tops/shirts-and-polos/"; var items = chromeDriver.FindElements(By.ClassName("product-name")); items.Count().Dump(); chromeDriver.Quit();` I still get a count 12 instead of 24. – inquisitive_one Jul 28 '17 at 19:06
  • I think most people would agree that it's a bad practice to take other answers and add them to your own without any significant contribution. – JeffC Aug 04 '17 at 14:17
  • @JeffC sorry, I adjusted to remove the reference to the size parameter from the other answer. As for the scrolling part I was merely answering his comment about why he wasn't getting all of them. I didn't get that from your answer. No different than you suggesting selenium after I did.. – CodingKuma Aug 04 '17 at 18:39
  • @CodingKuma It's very different. I didn't just say, "Use Selenium" I had a description of the problem and then provided solutions including code. Your answer was from a week and a half ago and you edited your answer recently and conveniently included comments from two other answers. – JeffC Aug 04 '17 at 18:51
  • @JeffC fine I removed my update, even though I didn't even read your answer before replying and adding it. – CodingKuma Aug 04 '17 at 19:04
3

As others said, the page from this site loads itself dynamically using some javascript, so the Html Agility Pack just gets the first items.

Web Scraping can be tough, especially with modern sites which use more and more javascript, and it's in general very specific to the target site (I'm not even talking about the legal issues..). You can use various techniques to determine how to get the information you require.

In this case, if you use any network analyzer, you'll quickly see the site uses an 'sz' (for Size I guess) query string parameter that allows you to specify the number of items you want.

So, just modify your url for this:

var url = @"http://www.roots.com/ca/en/men/tops/shirts-and-polos/?sz=9999";

and get any numbers of items you want.

Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • While this is useful info, it doesn't answer the question. He's already getting 20 products and only seeing the first 12. Getting 9999 products isn't going to solve that issue. – JeffC Aug 04 '17 at 18:54
  • @JeffC - ??? Without the sz parameters, you dont get all products in one HTTP GET, only a portion, that's precisely the question. Defining sz with a big value will get the maximum possible number of items in one GET (up to 9999 in my sample), ie 20 for this query. Try both urls will fiddler and you will understand. – Simon Mourier Aug 05 '17 at 06:23
  • No, the question is, "hey... there are 20 products on the page and I'm only getting 12, why is that?" If OP uses your answer, the next question will be, "hey... there are 9999 products on the page and I'm only getting 12, why is that?" Ref: `There are over 20 items on that page, but I seem to only get a random 12.` – JeffC Aug 05 '17 at 12:59