4

I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example):

<span class="c-price">650.00                            </span>

I'm using the following code for the extraction.

using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http

//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();

//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();

//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);

//Do something with LINQ
var pricesListItemsLinq = document.All
     .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());

However, I'm not getting any items, but they are there on the website. What am I doing wrong? If AngleSharp isn't the recommended method, what should I use? And what code should I use?

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
inquisitive_one
  • 1,465
  • 7
  • 32
  • 56
  • You may want to try `document.QuerySelectorAll("span.c-price")` instead. – Lucas Trzesniewski Sep 06 '15 at 19:57
  • The elements you're trying to query for are added to the page dynamically. You'll need to execute the javascript on the page. I don't know if AngleSharp can do that. – Jeff Mercado Sep 06 '15 at 20:19
  • @LucasTrzesniewski I tried your suggestion and I still don't get anything. – inquisitive_one Sep 07 '15 at 00:13
  • @JeffMercado AngleSharp seems to have a JS library. I added the library and used the following: `var config = Configuration.Default.WithJavaScript(); var parser = new HtmlParser(config);`. I still don't have any luck. Any suggestions on alternatives? – inquisitive_one Sep 07 '15 at 00:20
  • Honestly I don't know of any .NET library designed just to parse and execute the scripts natively just for the sake of querying it. I think your only option here is to load it up in a browser and scrape from that. You could probably use something like Selenium for that. – Jeff Mercado Sep 07 '15 at 00:23

1 Answers1

10

I am late at the party, but I try to bring some sanity here.

Querying static webpages

For this we require the following set of tools / functionality:

  • HTTP requester (to obtain resources, e.g., HTML documents, via HTTP), potentially with a SSL/TLS layer on top (either accepting all certificates or working against the certificate store / known CAs)
  • HTML parser
  • A queryable object model representation of the parsed HTML document
  • Maybe additionally some cookie state and the ability to follow links / post forms

AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, e.g., to accept all certificates).

We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. In this tab we can open a new document (either from a local source, a constructed source, or a remote source).

var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");

Once we have the document we can use CSS query selectors to obtain certain elements. These elements can be used to gather the information we look for.

AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible.

So instead of

var pricesListItemsLinq = document.All
    .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));

We write

var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");

This is also much more robust (the ClassList is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains or ClassName.Equals (the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class).

Dealing with dynamic pages

This is far more complicated. The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. Additionally, we need

  • A JavaScript engine
  • A valid CSSOM
  • A fake (or even fully computed) rendering tree
  • A lot more DOM interfaces that can be found in real browsers (e.g., navigator, full history, web workers, ...) - the list is limitless here

While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. Furthermore, the CSSOM may also be not complete enough for one or the other web application. Keep in mind that these pages are potentially designed for real browsers. They make certain assumptions. They may even require user input (e.g., Google Captcha).

Long story short.

var config = Configuration.Default
    .WithDefaultLoader()
    .WithCss()
    .WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);

The Task behind the await when opening a new document is equivalent to a load event in the DOM. Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. resources that needed to be downloaded.

Hope this helps a bit!

Florian Rappl
  • 3,041
  • 19
  • 25
  • Thank you for this helpful post! I installed AngleSharp via nuget package manager browser for a .net core 2 project and needed to also install AngleSharp.Scripting.JavaScript package in order to specify WithJavaScript() – kyle Jan 01 '18 at 21:22
  • @Florian Rappl how can I make a POST request using context? – anatol Apr 04 '18 at 13:47
  • With context (I think you refer to BrowsingContext) you can just navigate to pages. Such navigations are always GET requests. Form submissions (i.e., POST) are done via forms. The respective loaders (or the low level requesters) can, of course, invoke any kind of requests - including POST requests. – Florian Rappl Apr 04 '18 at 18:20