Parsing html with the HTML Agility Pack and Linq

Question

I have the following HTML

(..)
<tbody>
 <tr>
  <td class="name"> Test1 </td>
  <td class="data"> Data </td>
  <td class="data2"> Data 2 </td>
 </tr>
 <tr>
  <td class="name"> Test2 </td>
  <td class="data"> Data2 </td>
  <td class="data2"> Data 2 </td>
 </tr>
</tbody>
(..)

The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.

Currently I'm using:

var data =
    from
        tr in doc.DocumentNode.Descendants("tr")
    from   
        td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
    where
        td.InnerText == "Test1"
    select tr;

But I get {"Object reference not set to an instance of an object."} when I try to look in data

Exactly, what are you trying to do? And what is the code doing that you don't want? — R. Martinho Fernandes, Jan 06 '11 at 15:55
Can you tell us what your error is? Or what you're expecting to happen that doesn't happen? — James Walford, Jan 06 '11 at 16:03
I've changed my question, hopefully to make it a bit more understanding. — Timo Willemsen, Jan 06 '11 at 16:05
In your example the text in your tds has a preceding and trailing whitespace, whereas the string you're looking for doesn't. — James Walford, Jan 06 '11 at 16:17
You might try "where td.InnerText.Trim().Equals("Test1", StringComparison.InvariantCultureIgnoreCase)". — Jacob Proffitt, Jan 06 '11 at 16:29

score 16 · Accepted Answer · answered Jan 06 '11 at 16:59

As for your attempt, you have two issues with your code:

ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.

With these two corrections, the following works:

var data =
      from tr in doc.DocumentNode.Descendants("tr")
      from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
     where td.InnerText.Trim() == "Test1"
    select tr;

score 5 · Answer 2 · answered Jan 06 '11 at 17:48

5

Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)

This function gets all data values associated with a name:

public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
    return from HtmlNode node in
        document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
        select node.InnerText.Trim();
}

For example, this code will dump all 'Test2' data:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(yourHtml);

    foreach (string data in GetData(doc, "Test2"))
    {
        Console.WriteLine(data);
    }

answered Jan 06 '11 at 17:48

Simon Mourier

132,049
21
248
298

I thought about one xpath with `contains`, but it does have a major problem: searching for `Test1` will also find `Test10`, `NotTest1` and so forth. I don't really know enough xpath to get over that problem... – Kobi Jan 06 '11 at 18:00
@Kobi - If you don't want to use contains, then you can use =. If whitespaces are an issue, they can be removed with normalize-space, or else this link has more info: http://stackoverflow.com/questions/1852571/xpath-function-to-remove-white-space – Simon Mourier Jan 06 '11 at 19:11
3

The reason I prefer the Linq answer over XPath is because the latter is hard to read and understand. The former is perfectly clear what is intended, and if necessary you can break the query into subqueries to debug it. XPath is obtuse and impossible to debug. It's difficult to verify it's doing the right thing without a lot of test data. Just googling for an authoritative page on XPath syntax is hateful chore. I still love HAP, but every time I see an XPath statement I cringe. – Dan Bailiff Jan 21 '13 at 18:55
2

Everything is hard when you don't know it. I think XPATH is much easier to use and understand when querying an XML set. It also handles nulls gracefully (unlike Linq). The only (big) drawback is it's not cool for case insensitive comparison. Another issue is XPATH is not portable (does not exist on WinRT for example). Anyway, use the library you prefer :-) – Simon Mourier Jan 21 '13 at 19:01
Conversely, everything seems easy if you understand it. That doesn't mean it's easy to everyone else. LINQ has many applications. XPATH does not. Personally in my case I'd rather say .Descendents("a").Last() over "//a[last()]" but since my need is this one little thing I apprecate you jogging my memory of xpath and upvote your answer. – Christopher Painter Jun 07 '19 at 14:49

Kobi · Answer 3 · 2011-01-06T16:51:35.323

Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
                              .SelectNodes("//table[@id='MyTable']//tr");
var data = nodes.Select(
    node => node.Descendants("td")
        .ToDictionary(descendant => descendant.Attributes["class"].Value,
                      descendant => descendant.InnerText.Trim())
        ).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];

Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.

score -1 · Answer 4 · answered Aug 27 '13 at 20:08

I can recommend one of two ways:

http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.

Or,

Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.

score -1 · Answer 5 · answered Jan 06 '11 at 17:04

-1

instead of

td.InnerText == "Test1"

try

td.InnerText == " Test1 "

or

d.InnerText.Trim() == "Test1"

answered Jan 06 '11 at 17:04

Kurru

14,180
18
64
84

Parsing html with the HTML Agility Pack and Linq

5 Answers5

Linked