HtmlAgilityPack, get a sequence of nodes with a label

Question

Imagine an Html document similar to this

   <div>
      <div>...</div>
      <table>...</table>
      <p>...</p>
      <p>...</p>
      <p>...</p>
      <table>...</table>
      <p>...</p>
      <div>...</div>
      <p>...</p>
      <p>...</p>
    </div>

And I would like to take the first sequence of paragraphs nodes. I have tried to iterate over the node collection of p's checking nextSibling until find a name different to p, but this is always text.

More specifically, what I want is to get the first part of text from a wikipedia page. I mean, all the paragraphs before find a non paragraph like a table of content or the end of the page on other pages. In the example before, I would like to take the HtmlDocument with the first three paragraphs.

I could do this converting to a string, and using IndexOf. However I prefer a more generic solution because I don't know what I am going to find in wikipedia pages.

what do you mean by "find a name different to p"? can you provide the "text" you are getting? and a more complete html input — roboto1986, Jan 17 '13 at 22:07
The text between the
tags is probably whitespace. Have you looked at it in the debugger? I'd `foreach` through the first div's `Elements` property. — jessehouwing, Jan 17 '13 at 22:36
Sorry, seems it was not enough clear. Just edited the explanation. — gpupu, Jan 18 '13 at 09:17

score 1 · Answer 1 · answered Jan 17 '13 at 22:49

You can use use SkipWhile and TakeWhile in combination with the list of children from the div.

 var children = doc.DocumentNode.SelectNodes("/div/*");
 var paragraphs = children
      .SkipWhile(child => !string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase))
      .TakeWhile(child => string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase));

HtmlAgilityPack, get a sequence of nodes with a label

1 Answers1