7

I use the xml library in Python3.5 for reading and writing an xml-file. I don't modify the file. Just open and write. But the library modifes the file.

  1. Why is it modified?
  2. How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.

This is the example file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<movie>
    <title>Der Eisbär</title>
    <ids>
        <entry>
            <key>tmdb</key>
            <value xsi:type="xs:int" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">9321</value>
        </entry>
        <entry>
            <key>imdb</key>
            <value xsi:type="xs:string" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">tt0167132</value>
        </entry>
    </ids>
</movie>

This is the code

import xml.etree.ElementTree as ET
tree = ET.parse('x.nfo')
tree.write('y.nfo', encoding='utf-8')

And the xml-file becomes this

<movie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title>Der Eisbär</title>
    <ids>
        <entry>
            <key>tmdb</key>
            <value xsi:type="xs:int">9321</value>
        </entry>
        <entry>
            <key>imdb</key>
            <value xsi:type="xs:string">tt0167132</value>
        </entry>
    </ids>
</movie>
  • Line 1 is gone.
  • The <movie>-tag in line 2 has attributes now.
  • The <value>-tag in line 7 and 11 now has less attributes.
buhtz
  • 10,774
  • 18
  • 76
  • 149
  • 1
    In general, the short names (and where they are specified) for XML namespaces cannot be expected to be stable. But why aren't you using `lxml` anyway? – o11c Sep 01 '17 at 04:02
  • 1
    `lxml` manages to preserve the namespaces by default, although you still have to pass a flag to get the XML declaration up top. – o11c Sep 01 '17 at 04:06
  • @o11c You mean a python package `lxml`? I didn't notice it. I was just using `xml` as search term in the python doc and found `ElementTree`. – buhtz Sep 01 '17 at 07:04
  • @o11c `lxml` doesn't help, too. It does some transformations of the code, too. – buhtz Sep 01 '17 at 07:52
  • Well, *nobody* tries to preserve attribute order. So what `lxml` does is sort them. So even if they are changed on the *first* write, they will be consistent on all subsequent writes. – o11c Sep 01 '17 at 15:38

1 Answers1

6

Note that "xml package" and "the xml library" are ambiguous. There are several XML-related modules in the standard library: https://docs.python.org/3/library/xml.html.

Why is it modified?

ElementTree moves namespace declarations to the root element, and namespaces that aren't actually used in the document are removed.

Why does ElementTree do this? I don't know, but perhaps it is a way to make the implementation simpler.

How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.

I don't think there is a way to prevent this. The issue has been brought up before. Here are two very similar questions with no answers:

My suggestion is to use lxml instead of ElementTree. With lxml, the namespace declarations will remain where they occur in the original file.

Line 1 is gone.

That line is the XML declaration. It is recommended but not mandatory to have one.

If you always want an XML declaration, use xml_declaration=True in the write() method call.

mzjn
  • 48,958
  • 13
  • 128
  • 248