Speeding up CsQuery selectors by using html substring

Question

I want to parse some complex/heavy HTML pages. I recently read about CsQuery and checked the performance comparation of CsQuery Vs Html Agility Pack and Fizzler . According to these tests, CsQuery turns to be slower when creating the DOM due to its index creation.

Let's say I want to select certain element (without an id) of a heavy html page, and I know the ID of an ancestor of it, which I will use as a context element. If I load this heavy html into DOM, it will be slow, therefore my selection will be slow. However, if I can SOMEHOW FAST pre-process the html and get the sub-string containing the context element (which ID I know) and load that into DOM, it will be faster. In that case I would have gotten rid of lots of unneeded HTML for which indexers will not be created. Therefore, my selection will be faster.

I am using CsQuery because I want something JQuery-like.

My question is:

Given an HTML document string: Is there a FAST WAY (eg: linear) to get the HTML sub-string of an HTML element given its id?

Were you able to get my suggestion working? – Benjamin Gruenbaum Oct 21 '13 at 20:03 — Benjamin Gruenbaum, Oct 21 '13 at 20:03

Benjamin Gruenbaum · Answer 1 · 2013-03-16T03:12:50.050

First of all let me say that I think you've made the correct choice with CsQuery, I switched from HAP to it a while ago and I couldn't be happier with the switch. The newest pre-release of CsQuery lets you turn off indexing completely, or only do partial indexing of your code.

From the issue tracker.

In the current prerelease code there's an alternate indexing strategy you can use which speeds up DOM construction quite a bit, at the expense of complex queries. (Actually there's two new strategies, you can turn the index off altogether if you really want to :) This may be better for the kind of scenarios you're dealing with.

If you're willing to download the code from its git-hub repository and compile it, working with the pre-release you'll be able to do just that.

The DomIndexProviders class contains 3 options, RangedDomIndexProvider which indexes a lot of selectors and is very clever SimpleDomIndexProvider which allows basic indexing and NoDomIndexProvider which does not do indexing at all. SimpleDomIndexProvider is very straightforward and might work in your case, you might also consider no indexing.

The new version with this feature is also available on NuGet now if you "include preleeases" http://www.nuget.org/packages/CsQuery/1.3.5-beta3 — Jamie Treworgy, Mar 16 '13 at 06:32

Speeding up CsQuery selectors by using html substring

1 Answers1