7

I currently have a problem reading in XHTML as the XML parser doesn't recognise HTML character entities so:

<?php
$text = <<<EOF
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Entities are Causing Me Problems</title>
  </head>
  <body>
    <p>Copyright &copy; 2010 Some Bloke</p>
  </body>
</html>
EOF;

$imp = new DOMImplementation ();
$html5 = $imp->createDocumentType ('html', '', '');
$doc = $imp->createDocument ('http://www.w3.org/1999/xhtml', 'html', $html5);

$doc->loadXML ($text);

header ('Content-Type: application/xhtml+xml; charset: utf-8');
echo $doc->saveXML ();

Results in:

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Entity 'copy' not defined in Entity, line: 8 in testing.php on line 19

How can I fix this while allowing myself to serve pages as XHTML5?

Justin Johnson
  • 30,978
  • 7
  • 65
  • 89
casr
  • 1,166
  • 2
  • 11
  • 17

4 Answers4

12

XHTML5 does not have a DTD, so you may not use the old-school HTML named entities in it, as there is no document type definition to tell the parser what the named entities are for this language. (Except for the predefined XML entities &lt;, &amp;, &quot; and &gt;... and &apos;, though you generally don't want to use that).

Instead use a numeric character reference (&#169;) or, better, just a plain unencoded © character (in UTF-8; remember to include the <meta> element to signify the character set to non-XML parsers).

bobince
  • 528,062
  • 107
  • 651
  • 834
  • After some searching around this does indeed appear to be the case. Seems odd but thank you very much for the information. – casr Feb 14 '10 at 19:01
  • HTML5 defines all the old HTML named entities as part of its spec, it's only *XHTML5* that doesn't, and that's mainly because *XML* requires these defined in a DTD which HTML5/XHTML5 doesn't have. – thomasrutter May 11 '16 at 06:30
2

Try using DOMDocument::loadHTML() instead. It doesn't choke on imperfect markup.

Xorlev
  • 8,561
  • 3
  • 34
  • 36
  • 4
    That leads to some funky output ( http://paste2.org/p/668291 ) not to mention I don't like the idea of parsing XML as HTML. – casr Feb 14 '10 at 17:56
0

You shouldn't use loadXML and saveXML and add at the top of a html document the tag

<?xml.

Instead that use loadHTML and saveHTML and add a

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


useless
  • 1,876
  • 17
  • 18
0

Hy try with cdata

$text = <<<EOF
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Entities are Causing Me Problems</title>
  </head>
  <body>
    <![CDATA[<p>Copyright &copy; 2010 Some Bloke</p>]]>
  </body>
</html>
EOF;
streetparade
  • 32,000
  • 37
  • 101
  • 123