18

I don't like some of the design decisions made in HtmlAgilityPack:

  • When using SelectNodes, if no nodes are found, it returns null rather than an empty set, so you can't just foreach over it without a null check.
  • When trying to select children with node.SelectNodes it actually searches from the document root unless you use descendant:: which is not obvious nor expected behavior at all, IMO.
  • HtmlDocument.Load doesn't return the root node, which is what you'd want 99% of the time, I think

You might disagree with that of course, but that's not the point. I'm looking for something different. Something that behaves a little more expected, or something that uses jQuery syntax would be even better. Suggestions?

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • i hope it is works for you, http://code.google.com/p/fizzler/ –  Jan 06 '12 at 14:21
  • 1
    For the examples you've given, it should be fairly easy to alter the behaviour to that which you desire. Since HtmlAgilityPack is open source, have you considered taking a local fork and making those changes? – Adam Ralph Sep 10 '10 at 06:04
  • For the time being I've just wrapped it with my own functions, but still. If there's something else out there a little more aligned with my philosophies, I'm not going to waste my efforts :) I only dabble in HTML parsing once in awhile for small projects, so I don't think it's worth my time to overhaul it to be the way I think it ought to be. – mpen Sep 10 '10 at 07:16
  • 1
    [CsQuery](https://github.com/jamietre/CsQuery) is a jQuery port for .NET 4 – hjdm Jan 27 '13 at 07:47

2 Answers2

4

Started project called SharpQuery

Currently supports ID, class, tag, and attribute selectors.

a
a[href]
a[href^=http://stackoverflow.com]
.class
#id

Update: I'm not maintaining this project, sorry. CsQuery has recent updates (as of July 2013), but I don't have any experience using it.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • 1
    I voted for SharpQuery + HTMLAgilityPack to merge a long time ago. Cos the HTML parser and DOM structure should be distinctly different from the query engine... Also, HTMLAgilityPack supports multiple query methods - XPath, LINQ and DOm traversal. SharpQuery on top of that would be awesome – CVertex Sep 27 '10 at 02:50
  • Oh, wait u just started that project? I proposed joining a different project that does the exact same thing as SharpQuery... lemme find it – CVertex Sep 27 '10 at 02:51
  • 1
    FOund it - http://code.google.com/p/fizzler/ and here's my merge request http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=76383 – CVertex Sep 27 '10 at 02:51
  • @CVertex: Could have mentioned Fizzler before I started this project :p I just added a neat regex selector `a[href %= /caseInsensitive/i]` – mpen Sep 27 '10 at 03:39
  • Anyway, SharpQuery is just a set of extension methods that work on `IEnumerable`, so it should play pretty nicely with HtmlAgilityPack. Might make it based on `IXPathNavigable` instead if I can figure out how... then it should work with any XML document. – mpen Sep 27 '10 at 03:53
2

If you're just parsing the html, another option might be SgmlReader. If you're modifying the html, not so much. Don't recall how it behaves with respect to the issues you raised,but it worth checking out.

aciemian
  • 76
  • 1
  • As far as I can see, that library only converts malformed HTML into valid HTML... it says nothing about xpath/querying/traversing the DOM tree. I don't need to modify the document, but I *do* need to query it. – mpen Sep 14 '10 at 01:22
  • 4
    It turns it into valid xml in the form of an XmlDocument. Then you can call one of the XmlDocument.CreateNavigator() overloads to get an XPathNavigator object to perform xpath queries. – aciemian Sep 15 '10 at 00:49