Python XML parsing removing empty CDATA nodes

Question

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.

I simplified my code to test the process:

from xml.dom import minidom

filename = 'Test.xml'

dom = minidom.parse(filename)

with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
    fh.write(dom.toxml())

Using this for the Test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<body>
    <![CDATA[]]>
</body>

This is what the Text_Generated.xml file is:

<?xml version="1.0" ?><body>
    
</body>

A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.

I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.

Anyone know what is going on with this?

Workaround to keep the CDATA sections: use lxml with the right configuration. https://stackoverflow.com/a/25813863/407651 — mzjn, Nov 18 '22 at 10:39
@mzjn I appreciate your suggestion but have been adding additional methods to the minidom library (javascript-like methods: querySelector, closest). With the time that I have invested I might look into doing something similar with the parser used for minidom. — R0NUT, Nov 18 '22 at 14:24

score 1 · Accepted Answer · answered Nov 18 '22 at 09:29

Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.

If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.

Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Python XML parsing removing empty CDATA nodes

1 Answers1