34

I just wrote up this test to see if I was crazy...

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HtmlAgilityPackFormBug
{
    class Program
    {
        static void Main(string[] args)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(@"
<!DOCTYPE html>
<html>
    <head>
        <title>Form Test</title>
    </head>
    <body>
        <form>
            <input type=""text"" />
            <input type=""reset"" />
            <input type=""submit"" />
        </form>
    </body>
</html>
");
            var body = doc.DocumentNode.SelectSingleNode("//body");
            foreach (var node in body.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
                Console.WriteLine(node.XPath);
            Console.ReadLine();
        }
    }
}

And it outputs:

/html[1]/body[1]/form[1]
/html[1]/body[1]/input[1]
/html[1]/body[1]/input[2]
/html[1]/body[1]/input[3]

But, if I change <form> to <xxx> it gives me:

/html[1]/body[1]/xxx[1]

(As it should). So... it looks like those input elements are not contained within the form, but directly within the body, as if the <form> just closed itself off immediately. What's up with that? Is this a bug?


Digging through the source, I see:

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

It has the "empty" flag, like META and IMG. Why?? Forms are most definitely not supposed to be empty.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Out of curiosity, does it still behave like that if you give the form an action and method? – Marc Gravell Nov 18 '10 at 19:54
  • @Marc: That thought occurred to me too, and yes, it does still behave that way. – mpen Nov 18 '10 at 19:56
  • @Mark - it *sounds* like it might be a bug then... it *certainly* seems contrary to expectation. – Marc Gravell Nov 18 '10 at 19:59
  • @Marc: Well that sucks. I'm basing my entire project on this, and now I find out I can't trust it to do what's expected of it. Might have to switch to SgmlReader, but I don't know if that'll be any better. – mpen Nov 18 '10 at 20:04
  • 3
    I fully agree. This is an intriguing find (I must come back and upvote this tomorrow - I have run out of votes for today) – Marc Gravell Nov 18 '10 at 20:05
  • Since I'm the original HAP author, I can explain why it's marked as empty, see my full answer below, as comments are limited in size :) – Simon Mourier Nov 21 '10 at 09:41

2 Answers2

37

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.

You can change this without recompiling. The ElementFlags list is a static property on the HtmlNode class. It can be removed with

    HtmlNode.ElementsFlags.Remove("form");

before doing the document load

MatthewMartin
  • 32,326
  • 33
  • 105
  • 164
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Thanks Hans :) I just discovered C# supports static constructors... that'll be a good place to put this fix. – mpen Nov 18 '10 at 20:19
25

Since I'm the original HAP author, I can explain why it's marked as empty :)

This is because when HAP was designed, back in 2000, HTML 3.2 was the standard. You're probably aware that tags can perfectly overlap in HTML. That is: <b>bold<i>italic and bold</b>italic</i> (bolditalic and bolditalic) is supported by all browsers (although it's not officially in the HTML specification). And the FORM tag can also perfectly overlap as well.

Since HAP has been designed to handle any HTML content, rather than break most pages that you could find at that time, we just decided to handle overlapping tags as EMPTY (using the ElementFlags property) so:

  • you can still load them
  • you can save them back without breaking the original HTML (If you don't need what's inside the form in any programmatic way).

The only thing you cannot do is work with them with the API, using the tree model, nor with XSL, or anything programmatic. Today, with XHTML/XML almost everywhere, this sounds strange, but that's why I created the ElementFlags :)

Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • Yes.... it does sound strange. I guess the question then is whether or not you have any plans to update HAP to work with current practices? (Thanks for the explanation) – mpen Nov 21 '10 at 23:55
  • I don't work on HAP any more (I have another similar library wich performs better - it's internal). The last version I released was 1.3. HAP is now available on codeplex with another person that can update it. This "overlap/empty tag" question has been raised many times :) you should raise this concern in the discussions / wishes. – Simon Mourier Nov 23 '10 at 10:02
  • But in the OP's example, the elements are not overlapping. The input elements are closed. I appreciate the work you've done on HAP. It's a huge help to many people. But hopefully the other author will fix it or at least someone with the motivation will fork it. – Josh Jun 15 '12 at 13:43
  • 3
    This would not be a 'fix', as it's by design, configurable by code, and open source. It could/would be a breaking change. – Simon Mourier Jun 15 '12 at 13:53