Get data using HAP (HTML Agility Pack) From Page

Question

A continuation of this post, I am trying to parse out some data from an HTML page. Here is the HTML (there is more info on the page, but this is the important section):

<table class="integrationteamstats">
<tbody>
<tr>
    <td class="right">
        <span class="mediumtextBlack">Queue:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Aban:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0%</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Staffed:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
</tr>
<tr>
    <td class="right">
        <span class="mediumtextBlack">Wait:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0:00</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">Total:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
    <td class="right">
        <span class="mediumtextBlack">On ACD:</span>
    </td>
    <td class="left">
        <span class="mediumtextBlack">0</span>
    </td>
</tr>
</tbody>
</table>

I need to get 2 pieces of information: the data inside of the td below Queue and the data inside the td below Wait (so the Queue count and wait time). Obviously the numbers are going to update frequently.

I have gotten to the point where the HTML is pilled into an HtmlDocument variable. And I've found something along the lines of using an HtmlNodeCollection to gather nodes that meet a certain criteria. This is basically where I am stuck:

HtmlNodeCollection tds = 
    new HtmlNodeCollection(this.html.DocumentNode.ParentNode);
tds = this.html.DocumentNode.SelectNodes("//td");

foreach (HtmlNode td in tds)
{
    /* I want to write:
     * If the last node's value was 'Queue', give me the value of this node.
     * and
     * If the last node's value was 'Wait Time', give me the value of this node.
     */
}

And I can go through this with a foreach, but I am not certain how to access the value or how to get the next value.

score 3 · Accepted Answer · answered Dec 18 '12 at 18:44

Generally, there's no need to go through with a foreach as getting the targeted information is pretty easy (with a foreach you'd have to manage the state of each iteration of the loop and it's really unwieldy).

First, you want to get the table. Filtering on the class attribute is generally a bad idea, as you can have multiple elements in an HTML document that have the class applied to it. If you had an id attribute, that would be ideal.

That said, if this is the only table with this class, then you can get the body of the table element using:

// Get the table.
HtmlNode tableBody = document.DocumentNode.SelectSingleNode(
    "//table[@class='integrationteamstats']/tbody");

From there, you want to get the individual rows. Since these are direct children of the tbody element, you can get the rows by position through the ChildNodes property, like so:

HtmlNode queueRow = tableBody.ChildNodes[0];
HtmlNode waitRow = tableBody.ChildNodes[1];

Then you want the second td element in each row. While there's a span tag in there that wraps the content, you want all of the text that's in the td element in it's entirety, you can use the InnerText property to get the value:

string queueValue = queueRow.ChildNodes[1].InnerText;
string waitValue = waitRow.ChildNodes[1].InnerText;

Note, there's replication here, so if you find there are a lot of rows that you have to parse like this, you might want to factor out some of the logic into helper methods.

Excellent! Thank you casperOne! I think this will work, once I get my other question answered about using [c# to pull the DOM snapshot vs source code.](http://stackoverflow.com/questions/13939532/webclient-pull-dom-snapshot-data-vs-source) — Sugitime, Dec 18 '12 at 19:23

Jamie Treworgy · Answer 2 · 2012-12-18T21:58:06.720

1

You could also use CsQuery to do this. Since it uses familiar CSS selector syntax & jQuery methods, it can be easier to use than HAP for more complex DOM navigation. For example:

// function to get the text from the cell AFTER the one containing 'text'

string getNextCellText(CQ dom, string text) {
    // find the target cell
    CQ target= dom.Select(".integrationteamstats td:contains(" + text + ")");

    // return the text contents of the next cell
    return target.Next().Text();
}

void Main() {
    var dom = CQ.Create(html);
    string queue = getNextCellText(dom,"Queue");
    string wait = getNextCellText(dom,"Wait:");

    .. do stuff
}

edited Dec 18 '12 at 21:58

answered Dec 18 '12 at 19:35

Jamie Treworgy

23,934
8
76
119

I like this very much. Unfortunately, HTML Agility Pack is the 800 lb gorilla and a clear HTML parser with jQuery support hasn't emerged as the winner yet. – casperOne Dec 18 '12 at 21:16
Well HAP certainly wins for longevity. But I didn't realize there was any competition for .NET jQuery ports! Also - the HTML parser I'm using now is a port of the validator.nu parser (the same code Firefox uses), there is no comparison. It's actually standards compliant, HAP's is not at all. – Jamie Treworgy Dec 18 '12 at 21:22
See [ScrapySharp](http://nuget.org/packages/ScrapySharp), and it works with HTML Agility Pack. Also, when dealing with scraping sites, it's not about being standards compliant, it's usually about handling malformed HTML gracefully, which HTML Agility Pack does. I'm not saying CsQuery is bad (I want to look into it), but I'm looking for the clear leader in selector tech. – casperOne Dec 18 '12 at 21:33
That's appears to be just a CSS selector engine - like Fizzler - not a jQuery port. I don't know very much about it, but as far as CSS selectors go, the CsQuery implementation is comprehensive and is covered by the entire jQuery and Sizzle test suite (ported to C#) so I feel pretty good about it. The big problem I had in the past with HAP's parser has to do not with malformed HTML, but valid HTML that omits optional tags. This can result in a different DOM as a browser b/c the closing tag is interpolated at a different position than the spec describes. – Jamie Treworgy Dec 18 '12 at 21:40
.. anyway - I encourage you to check out the project. A great deal of effort has gone into making it fast, rock-solid and comprehensive. The other CSS selector engines (Fizzler and ScrapySharp) are certainly useful but far less built out; ScrapySharp doesn't actually seem to implement any of the CSS filters, and Fizzler has only partial coverage. – Jamie Treworgy Dec 18 '12 at 21:44

Get data using HAP (HTML Agility Pack) From Page

2 Answers2