3

The following working code reads my XML file containing lots of empty elements, then applies 2 changes and saves it again under different name. But it also changes empty elements like <element></element> to self-closing tags like <element /> which is unwanted.
How to save it not using self-closing tags? Or by another words how to tell XML::LibXML to use empty tags? The original file is produced in commercial application, which uses style with empty elements, so I want to sustain that.

#! /usr/bin/perl

use strict;
use warnings;
use XML::LibXML;

my $filename = 'out.xml';
my $dom = XML::LibXML->load_xml(location => $filename);
my $query = '//scalar[contains(@name, "partitionsNo")]/value';
for my $i ($dom->findnodes($query)) {
$i->removeChildNodes();
$i->appendText('16');
}

open my $out, '>', 'out2.xml';
binmode $out;
$dom->toFH($out);
# now out2.xml has only self-closing tags where previously 
# were used empty elements
Petr Matousu
  • 3,120
  • 1
  • 20
  • 32
  • Self closing tags is how XML should look if there is no content. It is saving it as valid XML. I don't believe you will be able to or even should override that as it would make it invalid XML – Deckerz Aug 15 '17 at 10:38
  • 5
    @Deckerz of course `` is _not_ invalid XML. – CBroe Aug 15 '17 at 10:40
  • 1
    @Deckerz it is not invalid of course as CBroe stated and I need to maintain original style of document as it is produced by an application and to be able to trace changes back. – Petr Matousu Aug 15 '17 at 10:42
  • 3
    "to be able to diff files without false hits caused by swapping empty elements to self-closing tags" — You're solving the wrong problem. Stop trying to treat XML as not XML. If you want to diff it, use an XML aware diff tool. – Quentin Aug 15 '17 at 10:43
  • @Quentin , as you could guess the diff is not what is important to me. But thanks for the comment, I have edited my question. – Petr Matousu Aug 15 '17 at 10:45
  • 1
    Since, before you edited the question, diffs were the only reason you gave for desiring this: No, I couldn't have guessed that. – Quentin Aug 15 '17 at 10:48
  • 1
    you would have to manually edit the string output of XML. as `` is not valid and the fact a commercial application uses that is shocking. – Deckerz Aug 15 '17 at 11:04
  • 4
    @Deckerz: *`` is not valid* You keep saying that, but it's simply not true. `` is just as valid as ``. The latter is just a shortcut. – Dave Cross Aug 15 '17 at 11:11
  • @DaveCross according to XML spec it is not valid. – Deckerz Aug 15 '17 at 11:12
  • 4
    @Deckerz: So why does [the XML spec](https://www.w3.org/TR/REC-xml/#sec-starttags) give `
    ` as an example of an empty element?
    – Dave Cross Aug 15 '17 at 11:15
  • 4
    @Deckerz [according to the spec](https://www.w3.org/TR/REC-xml/#sec-starttags), the empty tags `` _should_ be used. Quote: _Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword EMPTY. For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY._ – simbabque Aug 15 '17 at 11:16
  • 3
    @Deckerz ... Furthermore the [_for interoperability_](https://www.w3.org/TR/REC-xml/#dt-interop) means _Marks a sentence describing a non-binding recommendation included to increase the chances that XML documents can be processed by the existing installed base of SGML processors which predate the WebSGML Adaptations Annex to ISO 8879._ That makes it pretty clear that it's a recommendation, but not something you HAVE TO do. Only SHOULD. Hence, `` is just as valid as ``. – simbabque Aug 15 '17 at 11:16
  • 2
    @Deckerz there's also another very popular one in for XHTML, which is an XML dialect: ``. – simbabque Aug 15 '17 at 11:17
  • @simbabque go to any XML validator and try both. you will notice an error complaining that the XML must be well formed when you try `` – Deckerz Aug 15 '17 at 11:26
  • 1
    @Deckerz that validators do not implement a specification correctly is nothing new. We're talking about the spec, not the real world. You'll always see problems in various programs regarding standard-conformity. Browsers in particular are the most prominent ones. The fact that libxml already parses OP's document tells us that it cannot be that malformed. libxml is after all the most common parsing library for XML. – simbabque Aug 15 '17 at 11:32
  • @simbabque just because it parses doesnt mean its valid, most XML parser will handle basic error corrections. Which is why LIBXML makes the tag self enclosed. – Deckerz Aug 15 '17 at 11:44
  • 2
    @Deckerz: *go to any XML validator and try both* - I just went to [the first Google result for "XML Validator"](https://www.w3schools.com/xml/xml_validator.asp). It accepts `` without complaint. Honestly, it's probably time you stopped digging :-) – Dave Cross Aug 15 '17 at 11:45
  • 1
    @DaveCross the w3schools one, seriously? :-) the official w3 one uses libxml2 and also supports it, but it can't link to a result directly. – simbabque Aug 15 '17 at 11:48
  • I used to have a piece of software that read wsdl files but didn't understand namespaces in front of the tag names. My solution was a text processing script I wrote in Perl to move those into individual namespace attributes on every single element. Then it would work. This question is totally valid, though it could do with a real why, and a [mcve]. I find it an intriguing problem actually. – simbabque Aug 15 '17 at 11:50
  • @simbabque that checks for XHTML not XML unless i have the wrong link. Also note if you just test `` on it own it will be valid because it is the root element. if you test it all inside a root element it will fail. – Deckerz Aug 15 '17 at 11:54
  • @Deckerz try this one: https://validator.w3.org/check?fragment=%3C%3Fxml+version%3D%221.0%22%3F%3E%0D%0A%3Croot%3E%3Cfoo%3E%3C%2Ffoo%3E%3Cbar+%2F%3E%3C%2Froot%3E%0D%0A&charset=%28detect+automatically%29&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices – simbabque Aug 15 '17 at 11:58
  • @simbabque: Like I said, I just went to the first result. And, yes, I felt dirty :-) And the one at validator.w3.org say it's for HTML/XHTML. Can it be trusted for XML? – Dave Cross Aug 15 '17 at 12:05
  • 3
    @Deckerz: I've now tried five online validators and none of them object to ``. Looks like you might be using a broken one. Care to tell us which ones you're using? – Dave Cross Aug 15 '17 at 12:09
  • 2
    it would be a bit of a dirty hack, but you could use a regex to correct the empty tags into the way you want? – Chris Turner Aug 15 '17 at 13:05
  • @ChrisTurner: I want to avoid that. – Petr Matousu Aug 15 '17 at 13:07
  • Actually the xml file I am working with is Ricardo Vectis session file (Vectis is CFD app in automotive for engines). I have found they are using both empty elements and self-closed elements in one file. So my need is rather do not change the file where the texts and attributes remains untouched. My target was not to trigger such a holly war. But thank you all for your time and inputs. Maybe some of you who feel the question is valid would give it votes. I tested the application to read/work libxml modified files and so far it seams ok even with self-clesing tads. Thank you all. – Petr Matousu Aug 15 '17 at 13:11

2 Answers2

6

Unfortunately, XML::LibXML doesn't support libxml2's xmlsave module which has a flag to save without empty tags.

As a workaround you can add an empty text node to empty elements:

for my $node ($doc->findnodes('//*[not(node())]')) {
    # Note that appendText doesn't work.
    $node->appendChild($doc->createTextNode(''));
}

This is a bit costly for large documents, but I'm not aware of a better solution.

That said, the fragments <foo></foo> and <foo/> are both well-formed and semantically equivalent. Any XML parser or application that treats such fragments differently is buggy.


Note that some people believe the XML spec recommends using self-closing tags, but that's not exactly true. The XML spec says:

Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword EMPTY. For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY.

This means elements that are declared EMPTY in a DTD. For other elements, or if no DTD is present, the XML standard advises not to use self-closing tags ("and should only be used"). But this is only a non-binding recommendation for interoperability.

ikegami
  • 367,544
  • 15
  • 269
  • 518
nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Thank you for your answer. After all I will not use it probably as the application does not complain. In light of that comment war under the question - I am glad for you pointing me to libxml2 xmlsave and its flag, not beeing implemented in perls XML::LibXML. – Petr Matousu Aug 15 '17 at 13:38
  • @PetrMatousu have you tried any other XML modules in Perl to do what you are doing? I _think_ there are some that do not depend on XML::LibXML and they might already behave the way you want. – simbabque Aug 15 '17 at 14:50
  • @simbabque I was searching among others as well, it looks like LibXML does well what I need. Some others are discouraged like SimpleXML and I need to avaid any data loss. I must made more testing of modified files. And XML:LibXML is nicely documented. Thank you for pointing that out. – Petr Matousu Aug 15 '17 at 16:53
0

There is a package variable

$XML::LibXML::setTagCompression

Setting it to a true value forces all empty tags to be printed as <e></e>, while a false value forces <e/>.

See SERIALIZATION in the Parser documentation.

choroba
  • 231,213
  • 25
  • 204
  • 289