13

I'm trying to use the HtmlAgilityPack to pull all of the links from a page that are contained within a div declared as <div class='content'> However, when I use the code below I simply get ALL links on the entire page. This doesn't really make sense to me since I am calling SelectNodes from the sub-node I selected earlier (which when viewed in the debugger only shows the HTML from that specific div). So, it's like it's going back to the very root node every time I call SelectNodes. The code I use is below:

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(@"http://example.com");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[@class='content']");
foreach(HtmlNode link in node.SelectNodes("//a[@href]"))
{
    Console.WriteLine(link.Value);
}

Is this the expected behavior? And if so, how do I get it to do what I'm expecting?

Adam Haile
  • 30,705
  • 58
  • 191
  • 286

1 Answers1

21

This will work:

node.SelectNodes("a[@href]")

Also, you can do it in a single selector:

doc.DocumentNode.SelectSingleNode("//div[@class='content']//a[@href]")

Also, note that link.Value isn't defined for HtmlNode, so your code doesn't compile.

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • This doesn't seem right with the XPath I know, but it works. I would also confess I used the HtmlAgilityPack for the first time just now, to answer the question. I can't find any documentation... – Kobi May 20 '10 at 17:48
  • 1
    regarding link.Value, I was rewriting this from memory... it was prob InnerHtml or something. So is the // making it always go back to root? I didn't get that impression from the XPath documentation on W3C – Adam Haile May 20 '10 at 18:08
  • T​​​​​​​​​​​​​​hat's pretty impressive from memory... Anyway, you are right - XPath that starts with `//` should respect its context, as far as I know. – Kobi May 20 '10 at 19:04
  • weird, kinda annoying it's not respecting the spec :( – Adam Haile May 20 '10 at 19:48
  • 1
    i should imagine that the // calls back to root as I would surmise that even though you grab a node from the tree it still has reference to the whole document otherwise it would be impossible to reverse travers back up the tree with .. – Paul Sullivan Dec 31 '12 at 12:02
  • 1
    Em getting this error. "Object reference not set to an instance of an object." – Shahid Karimi Jun 14 '13 at 12:01