1

Imagine an Html document similar to this

   <div>
      <div>...</div>
      <table>...</table>
      <p>...</p>
      <p>...</p>
      <p>...</p>
      <table>...</table>
      <p>...</p>
      <div>...</div>
      <p>...</p>
      <p>...</p>
    </div>

And I would like to take the first sequence of paragraphs nodes. I have tried to iterate over the node collection of p's checking nextSibling until find a name different to p, but this is always text.

More specifically, what I want is to get the first part of text from a wikipedia page. I mean, all the paragraphs before find a non paragraph like a table of content or the end of the page on other pages. In the example before, I would like to take the HtmlDocument with the first three paragraphs.

I could do this converting to a string, and using IndexOf. However I prefer a more generic solution because I don't know what I am going to find in wikipedia pages.

gpupu
  • 65
  • 8
  • what do you mean by "find a name different to p"? can you provide the "text" you are getting? and a more complete html input – roboto1986 Jan 17 '13 at 22:07
  • The text between the

    tags is probably whitespace. Have you looked at it in the debugger? I'd `foreach` through the first div's `Elements` property.

    – jessehouwing Jan 17 '13 at 22:36
  • Sorry, seems it was not enough clear. Just edited the explanation. – gpupu Jan 18 '13 at 09:17

1 Answers1

1

You can use use SkipWhile and TakeWhile in combination with the list of children from the div.

 var children = doc.DocumentNode.SelectNodes("/div/*");
 var paragraphs = children
      .SkipWhile(child => !string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase))
      .TakeWhile(child => string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase));
jessehouwing
  • 106,458
  • 22
  • 256
  • 341