Imagine an Html document similar to this
<div>
<div>...</div>
<table>...</table>
<p>...</p>
<p>...</p>
<p>...</p>
<table>...</table>
<p>...</p>
<div>...</div>
<p>...</p>
<p>...</p>
</div>
And I would like to take the first sequence of paragraphs nodes. I have tried to iterate over the node collection of p's checking nextSibling
until find a name different to p, but this is always text.
More specifically, what I want is to get the first part of text from a wikipedia page. I mean, all the paragraphs before find a non paragraph like a table of content or the end of the page on other pages. In the example before, I would like to take the HtmlDocument with the first three paragraphs.
I could do this converting to a string, and using IndexOf
. However I prefer a more generic solution because I don't know what I am going to find in wikipedia pages.
tags is probably whitespace. Have you looked at it in the debugger? I'd `foreach` through the first div's `Elements` property.
– jessehouwing Jan 17 '13 at 22:36