4

This is the first time I am using Html Agility Pack and facing problems straight away.

Just as my title suggest I want to get entire element as string including inner elements.

So for example below is my html and I am searching for a form element with id aspnetForm

<html>  
<head>  
</head>  
<body>  
  <form name="aspnetForm" id="aspnetForm">
    <div id="div1">  
        <a href="div1-a1">Link 1 inside div1</a>  
        <a href="div1-a2">Link 2 inside div1</a>  
    </div>  
    <a href="a3">Link 3 outside all divs</a>      
    <div id="div2">  
        <a href="div2-a1">Link 1 inside div2</a>  
        <a href="div2-a2">Link 2 inside div2</a>  
    </div> 
  </form> 
</body>  
</html>

I want the following to be the output (in string)

  <form name="aspnetForm" id="aspnetForm">
    <div id="div1">  
        <a href="div1-a1">Link 1 inside div1</a>  
        <a href="div1-a2">Link 2 inside div1</a>  
    </div>  
    <a href="a3">Link 3 outside all divs</a>      
    <div id="div2">  
        <a href="div2-a1">Link 1 inside div2</a>  
        <a href="div2-a2">Link 2 inside div2</a>  
    </div> 
  </form> 

I usually do not like to ask such spoon-feeding questions but I have been trying and searching but couldnt get an answer.

Please help!

Thanks in advance!

Saeb Amini
  • 23,054
  • 9
  • 78
  • 76
samar
  • 5,021
  • 9
  • 47
  • 71

2 Answers2

5

Seems you're looking for HtmlNode.OuterHtml:

//
// Summary:
//     Gets or Sets the object and its content in HTML.
public virtual string OuterHtml { get; }

So you just have to select your form node and get its OuterHtml property:

HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;

UPDATE

It seems there's a very old bug with how HAP treats form tags. Or maybe it's a feature!

In any case, here's a workaround:

HtmlNode.ElementsFlags.Remove("form");

So this should work:

HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;
Community
  • 1
  • 1
Saeb Amini
  • 23,054
  • 9
  • 78
  • 76
  • OuterHtml, for his example - also will not return what he wants. – Veverke May 25 '16 at 13:43
  • @Veverke, hmm according to the specs, it should. Unless I'm missing something it'd be a bug if it doesn't. – Saeb Amini May 25 '16 at 13:46
  • @Veverke See [example on dotNetFiddle](https://dotnetfiddle.net/YCu5RJ) (XmlDocument because dotNetFiddle doesn't have the HtmlAgilityPack, otherwise it's identical) – Manfred Radlwimmer May 25 '16 at 13:50
  • I definitely agree with you, that's why I say it is a good question - because the obvious approach - does not work. Or am I the one missing something ? Did you try it out and you get the correct output with OuterHtml ? – Veverke May 25 '16 at 13:55
  • Strangely, I just tried this with the latest version of HtmlAgilityPack and it returns `
    `, it seems it has a problem with the `form` element, but the HTML doesn't look malformed to me.
    – Saeb Amini May 25 '16 at 13:56
  • That's exactly what I get, hence why I raised a flag :). Exactly, again, I see nothing malformed here. – Veverke May 25 '16 at 14:01
  • @Veverke, that's interesting, I wonder if that's a bug. it'd be a glaring bug unless we're missing something obvious. – Saeb Amini May 25 '16 at 14:04
  • I again agree :-) This is really intriguing... – Veverke May 25 '16 at 14:06
  • @Veverke, found some relevant info explaining why it's happening if you're interested. – Saeb Amini May 25 '16 at 14:19
  • Nice ! This should be the solution - except - that the right property ultimately should be `InnerHtml` - the OP does not want the `
    ` tag itself, which `OuterHtml` includes. Upvoting.
    – Veverke May 25 '16 at 14:25
  • 1
    @Veverke thanks, but the OP explicitly mentions with an example that he does want the `form` tag in the output: _I want the following to be the output (in string) `
    `_
    – Saeb Amini May 25 '16 at 14:28
  • Argh, you are right... – Veverke May 25 '16 at 14:29
  • 1
    Wow. I finally figured out I was only having a problem getting the inner HTML from a form element and then I was able to find this answer. Thank you!!! – Kirk Liemohn Apr 19 '19 at 02:36
1

Indeed good question, weird enough all the following fails !

Using HtmlAgilityPack - not able yet to come up with a solution!

(note that I use the nuget library ScraySharp as well, to get the Css selectors extension (ScrapySharp.Extensions)

 string html = @"<html>
        <head>
        </head>
        <body>
          <form name='aspnetForm' id='aspnetForm'>
            <div id='div1'>
                <a href='div1-a1'>Link 1 inside div1</a>
                <a href='div1-a2'>Link 2 inside div1</a>
            </div>
            <a href='a3'>Link 3 outside all divs</a>
            <div id='div2'>
                <a href='div2-a1'>Link 1 inside div2</a>
                <a href='div2-a2'>Link 2 inside div2</a>
            </div>
          </form>
        </body>
        </html>";

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    string result = string.Empty;

    var formElement = doc.DocumentNode.CssSelect("form").FirstOrDefault();
    var formChildren = formElement.Descendants();

    StringBuilder sb = new StringBuilder();

    if (formChildren != null)
    {
        foreach (var child in formChildren)
        {
            sb.AppendLine(child.InnerHtml);
        }
    }

        //formElement.InnerHtml also returns empty !
        Console.WriteLine(sb.ToString());

You can however achieve this - way easier - with AngleSharp (angle sharp seems to be the recommendable option these days, since it is still maintained/developed, whereas HtmlAgility Pack not).

Using AngleSharp - works

 HtmlParser parser = new HtmlParser();
 var parsedDoc = parser.Parse(html);
 Console.WriteLine(parsedDoc.QuerySelector("form").InnerHtml);

Output (using AngleSharp):

enter image description here

Veverke
  • 9,208
  • 4
  • 51
  • 95
  • `OuterHtml`, not `InnerHtml` – Manfred Radlwimmer May 25 '16 at 13:44
  • Check it out, outer does not return what he wants either. – Veverke May 25 '16 at 13:45
  • This question is raising some interesting outcomes... one being that `ScrapySharp`'s `CssSelect` will not accept css selector `* > form` to get any node which parent is `form` - while `AngleSharp`'s `QuerySelector` will accept - and return the correct inner html here as well. (ScrapySharp has indeed problems with css selectors, it's not that reliable...) – Veverke May 25 '16 at 14:13
  • Regarding the document being malformed or not, [wc3's validator](http://validator.w3.org/check) points out 3 issues, but fixing them does not make any difference for agility pack. – Veverke May 25 '16 at 14:22
  • Upvoting for actually trying the code and finding out it doesn't work in the first place :) – Saeb Amini May 25 '16 at 14:34