0

I am trying to get the root node of a PHP DOM Document. This is usually done by doing something like this:

$doc->documentElement;

However, trying this on a HTML string that contains a doctype:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">...

and that is loaded into a DOM Document object like so:

$doc = new DOMDocument();
$doc->loadHTML($html);

returns the root node as the html tag and not the doctype tag! I am guessing this because of the weird characters <!- is there anyway to return the root node correctly?

Abs
  • 56,052
  • 101
  • 275
  • 409
  • [It's called "element type name", dammit.](http://www.flightlab.com/~joe/sgml/faq-not.txt). :-) The `DOCTYPE` declaration is an SGML construct that is not part of the document grammar itself. The root *element* is indeed the `html` element. In general, SGML declarations are things that look like `<!**** >`, where `****` is some keyword, and those are *not* part of the document tree. The only declarations that can appear in the top level are the doctype declaration, comment declarations ``, notation declarations (and perhaps marked sections). – Kerrek SB Nov 28 '11 at 14:36

3 Answers3

2

Doctype isn't the root node, html is. The doctype is simply the doctype declaration that tells the browser what the rest of the file is.

Maybe you can use DOMDocument::doctype ? ($doc -> doctype)

Tom van der Woerdt
  • 29,532
  • 7
  • 72
  • 105
  • I tried to use the doctype object to reconstruct the doctype but it doesn't give me back everything just some components. But I understand now, doctype isn't the root node. I'll just stick to my regex to get the doc type back. – Abs Nov 28 '11 at 14:52
0

I ran into this problem some time ago and it was because I actually didn't want the DOCTYPE in there at all. I was using code snippets and was having a hard time getting the returned values to be untainted with DOCTYPE and HTML tags added when there shouldn't be.

I am going to present an answer not in here yet just in case your having the same problem I had. My solution actually prevents the adding of any DOCTYPE elements if you have a newer version of php. I believe it's a minimum of PHP v5.4 and up and also LibXML v2.7.8 minimum. If you have both of these versions up to date then its as simple as adding a constant flag to the method call of the DOMDocument object's loadHTML implementation. The constant is LIBXML_HTML_NODEFDTD and it is used like this....

$doc = new DOMDocument();
$doc->loadHTML($someContentString, LIBXML_HTML_NODEFDTD);

This way there is no additional parsing needed at all and you can go about your life without this DOCTYPE problem... unless you needed the DOCTYPE tag in which case my answer and let someone else find it through Google :)

GoreDefex
  • 1,461
  • 2
  • 17
  • 41
0

the DOCTYPE is not actually a node, and it certainly isn't the root node. Try $doc->doctype.

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
  • 2
    The DOCTYPE is a Node (he inherits from DOMNode): `$doc->firstChild->...->nodeType === XML_DOCUMENT_TYPE_NODE`. [w3: Node::DOCUMENT_TYPE_NODE](http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-1950641247) – Saxoier Nov 28 '11 at 14:54