2

I have trouble to load XML document into DOM preserving empty tags and null-size strings. Here the example:

$doc = new DOMDocument("1.0", "utf-8");

$root = $doc->createElement("root");
$doc->appendChild($root);

$element = $doc->createElement("element");
$root->appendChild($element);

echo $doc->saveXML();

produces following XML:

<?xml version="1.0" encoding="utf-8"?>
<root><element/></root>

Empty element, exactly as expected. Now let's add empty text node into element.

$doc = new DOMDocument("1.0", "utf-8");

$root = $doc->createElement("root");
$doc->appendChild($root);

$element = $doc->createElement("element");
$element->appendChild($doc->createTextNode(""));
$root->appendChild($element);

echo $doc->saveXML();

produces following XML:

<?xml version="1.0" encoding="utf-8"?>
<root><element></element></root>

Non-empty element with null-size string. Good! But when I am trying to do:

$doc = new DOMDocument();
$doc->loadXML($xml);

echo $doc->saveXML($doc);

on these XML documents I always get

<?xml version="1.0" encoding="utf-8"?>
<root><element/></root>

ie null-size string is removed and just empty element is loaded. I believe it happens on loadXML(). Is there any way to convince DOMDocument loadXML() not to convert null-size string into empty element? It would be preferable if DOM would have TextNode with null-size string as element's child.

Solution is needed to be in PHP DOM due to the way what would happen to the loaded data further.

Vladimir Bashkirtsev
  • 1,334
  • 11
  • 24
  • By the way what is your $xml? – fortune Jun 07 '14 at 14:09
  • possible duplicate of [How to create an XML text node with an empty string value (in Java)](http://stackoverflow.com/questions/3884876/how-to-create-an-xml-text-node-with-an-empty-string-value-in-java). An empty text node (zero-length string) is not a text node, see http://stackoverflow.com/a/3885737/2044940 – CodeManX Jun 07 '14 at 14:13
  • Arbitrary xhtml content. I found that totally valid element of xhtml is not rendered properly if document served as Content-Type: text/html . It must be expressed as for browsers to understand it correctly. So my idea is to load xhtml before delivery to find if mark up has empty elements and fix it accordingly. Changing Content-Type on server is out of question as text/html is used by browser if file is saved locally. – Vladimir Bashkirtsev Jun 07 '14 at 14:16
  • @CoDEmanX Not a dupe: I am well aware that XML parsers see no difference between no string and empty string. My question is about how to note this difference in DOM tree on load by having empty TextNode - as in my example above. I can create empty child TextNode. Can I do the same on load? – Vladimir Bashkirtsev Jun 07 '14 at 14:23
  • XML and XHTML is not compatible with HTML, and XSLT Processors are really not to be used for HTML. I doubt PHP has a hidden option to disable self-closing elements to be used, so you have to either traverse the entire DOM tree and add empty text nodes (not sure how), or use a XSLT processor to expand the self-closing elements (example posted in own answer). – CodeManX Jun 07 '14 at 20:36
  • I guess original problem stems from the fact that XML and XHTML is not compatible with HTML. We using XHTML on the promise that it does the same as HTML. However it is only the case when Content-Type is application/xhtml+xml and results not guaranteed with Content-Type is text/html . Unfortunately browsers default to text/html if Content-Type is not specified (say if it is local file or remote server still serves text/html header regardless). difference is the first one we hit and looks like we now do a patchwork. May be we should look in different direction alltogether. – Vladimir Bashkirtsev Jun 08 '14 at 06:52

3 Answers3

3

The problem to distinguish between those two is, that when DOMDocument loads the XML serialized document, it does only follow the specs.

By the book, in <element></element> there is no empty text-node in that element - which is what others have commented already as well.

However DOMDocument is perfectly fine if you insert an empty text-node there your own. Then you can easily distinguish between a self-closing tag (no children) and an empty element (having one child, an empty text-node).

So how to enter those empty text-nodes? For example by using from the XMLReader based XMLReaderIterator library, specifically the DOMReadingIteration, which is able to build up the document, while offering each current XMLReader node for interaction:

$doc = new DOMDocument();

$iterator = new DOMReadingIteration($doc, $reader);

foreach ($iterator as $index => $value) {
    // Preserve empty elements as non-self-closing by making them non-empty with a single text-node
    // children that has zero-length text
    if ($iterator->isEndElementOfEmptyElement()) {
        $iterator->getLastNode()->appendChild(new DOMText(''));
    }
}

echo $doc->saveXML();

That gives for your input:

<?xml version="1.0" encoding="utf-8"?>
<root><element></element></root>

This output:

<?xml version="1.0"?>
<root><element></element></root>

No strings attached. A fine build DOMDocument. The example is from examples/read-into-dom.php and a fine proof that it is no problem when you load the document via XMLReader and you deal with that single special case you have.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Exactly what I was looking for! Tried your solution and it works great! I should commend you on your fine PHP skills - reading your source was a pleasure! BTW: is there any way to automatically pull git repo into svn? :) Joking! That's another question... Thank you! – Vladimir Bashkirtsev Jun 09 '14 at 10:18
  • @user3713667: This git repo is hosted by github which supports SVN: Subversion checkout URL: https://github.com/hakre/XMLReaderIterator – hakre Jun 13 '14 at 15:11
2

Here is no difference for the loading XML parser. The DOM is exactly the same.

If you load/save a XML format that has a problem with empty tags, you can use an option to avoid the empty tags on save:

$dom = new DOMDocument();
$dom->appendChild($dom->createElement('foo'));

echo $dom->saveXml();
echo "\n";
echo $dom->saveXml(NULL, LIBXML_NOEMPTYTAG);

Output:

<?xml version="1.0"?>
<foo/>

<?xml version="1.0"?>
<foo></foo>
ThW
  • 19,120
  • 3
  • 22
  • 44
  • I have no problem with saving part. Appending empty TextNode as a child to the element avoids it to be saved in self-closing manner. My issue is that I cannot get this empty TextNode back after I load XML back into DOM. Or I would like to have at least some property which indicates if the element is self-closed or empty in original XML document. – Vladimir Bashkirtsev Jun 07 '14 at 14:28
0

You can trick XSLT processors to not use self-closing elements, by pretending a xsl:value-of inserting a variable, but that variable being an empty string ''.

Input:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <foo>
    <bar some="value"></bar>
    <self-closing attr="foobar" val="3.5"/>
  </foo>
  <goo>
    <gle>
      <nope/>
    </gle>
  </goo>
</root>

Stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

  <xsl:template match="*[not(node())]">
    <xsl:copy>
      <xsl:for-each select="@*">
        <xsl:attribute name="{name()}">
          <xsl:value-of select="."/>
        </xsl:attribute>
      </xsl:for-each>
      <xsl:value-of select="''"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

Output:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <foo>
    <bar some="value"></bar>
    <self-closing attr="foobar" val="3.5"></self-closing>
  </foo>
  <goo>
    <gle>
      <nope></nope>
    </gle>
  </goo>
</root>

To solve this in PHP without the use of a XSLT processor, I can only think of adding empty text nodes to all elements with no children (like you do in the creation of the XML).

CodeManX
  • 11,159
  • 5
  • 49
  • 70
  • It appears that core issue is to detect if an element is self-closed or empty element in XML stream. Looking through libxml source it is clear that it does not distinguish between the two and so PHP DOM (being direct user of libxml) is unable to do it. Other xml parsers may but we limited to PHP DOM. So it appears we need to resort to some ugly regexps to pick up self-closed . Luckily we only need to detect it, not to transform. – Vladimir Bashkirtsev Jun 08 '14 at 07:06