1

I'm trying to match a table w/ regex but I'm having some issues. I can't figure out exactly why it will not match properly. Here is the HTML:

    <table class="integrationteamstats">
    <tbody>
    <tr>
        <td class="right">
            <span class="mediumtextBlack">Queue:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0</span>
        </td>
        <td class="right">
            <span class="mediumtextBlack">Aban:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0%</span>
        </td>
        <td class="right">
            <span class="mediumtextBlack">Staffed:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0</span>
        </td>
    </tr>
    <tr>
        <td class="right">
            <span class="mediumtextBlack">Wait:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0:00</span>
        </td>
        <td class="right">
            <span class="mediumtextBlack">Total:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0</span>
        </td>
        <td class="right">
            <span class="mediumtextBlack">On ACD:</span>
        </td>
        <td class="left">
            <span class="mediumtextBlack">0</span>
        </td>
    </tr>
    </tbody>
    </table>

I need to get 2 pieces of information: the data inside of the td below Queue and the data inside the td below Wait (so the Queue count and wait time). Obivously the numbers are going to update frequently.

This is the regex I have for pulling the initial table, but it isnt working:

Match statstable = Regex.Match(this.html, "<table class=\"integrationteamstats\">(.*?)</table>");

And I'm not sure what regex I should use to get the data from the td's.

Before anyone asks, no there is no way I can update the HTML to have an ID or anything of that nature. Its pretty much as is. The only thing that is consistent is the location of the td's.

Sugitime
  • 1,818
  • 4
  • 23
  • 44

1 Answers1

5

Instead of regex, I suggest using the HTML Agility Pack to parse the HTML and query its structure.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

In general, regex is a poor choice for parsing HTML.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • HTML Agility Pack definitely seems to be a robust and great system for this... Except for the blatant lack of documentation.... Its really tough to learn to use. – Sugitime Dec 18 '12 at 16:40
  • @Sugitime - The source download comes with a bunch of sample projects. – Oded Dec 18 '12 at 17:43
  • And it uses standard Linq-to-XML or XPath notation for queries. these are very well documented outside of the project documentation. – jessehouwing Dec 18 '12 at 21:03
  • An alternative to HAP is CsQuery, a .NET jQuery port, which lets you use CSS3 selector selectors instead of xpath. It also uses a standards-compliant HTML5 parser, and indexes the documents, making it much faster than HAP. The documentation is probably just as bad as HAP, but CSS selectors and jQuery methods are probably more familiar to most people than xpath these days. https://github.com/jamietre/csquery http://www.nuget.org/packages/CsQuery – Jamie Treworgy Dec 20 '12 at 07:08