1

I'm trying to read XML with ElementTree and write the result back to disk. My long-term goal is to prettify the XML this way. However, in my naive approach, ElementTree eats all the namespace declarations in the document and I don't understand why. Here is an example

test.xsd

<?xml version='1.0' encoding='UTF-8'?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'
    xmlns='sdformat/pose' targetNamespace='sdformat/pose'
    xmlns:pose='sdformat/pose'
    xmlns:types='http://sdformat.org/schemas/types.xsd'>

<xs:import namespace='sdformat/pose' schemaLocation='./pose.xsd'/>

<xs:element name='pose' type='poseType' />

<xs:simpleType name='string'><xs:restriction base='xs:string' /></xs:simpleType>
<xs:simpleType name='pose'><xs:restriction base='types:pose' /></xs:simpleType>

<xs:complexType name='poseType'>
    <xs:simpleContent>
      <xs:extension base="pose">
    <xs:attribute name='relative_to' type='string' use='optional' default=''>
    </xs:attribute>

      </xs:extension>
    </xs:simpleContent>
</xs:complexType>


</xs:schema>

test.py

from xml.etree import ElementTree

ElementTree.register_namespace("types", "http://sdformat.org/schemas/types.xsd")
ElementTree.register_namespace("pose", "sdformat/pose")
ElementTree.register_namespace("xs", "http://www.w3.org/2001/XMLSchema")

tree = ElementTree.parse("test.xsd")
tree.write("test_out.xsd")

Produces test_out.xsd

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="sdformat/pose">

<xs:import namespace="sdformat/pose" schemaLocation="./pose.xsd" />

<xs:element name="pose" type="poseType" />

<xs:simpleType name="string"><xs:restriction base="xs:string" /></xs:simpleType>
<xs:simpleType name="pose"><xs:restriction base="types:pose" /></xs:simpleType>

<xs:complexType name="poseType">
    <xs:simpleContent>
      <xs:extension base="pose">
    <xs:attribute name="relative_to" type="string" use="optional" default="">
    </xs:attribute>

      </xs:extension>
    </xs:simpleContent>
</xs:complexType>


</xs:schema>

Notice how test_out.xsd is missing any namespace declarations from test.xsd. I would expect them to be identical. I verified that the latter is valid XML by validating it. It validates with exception of my choice of namespace URI, which I think shouldn't matter.


Update:

Based on mzji's comment I realized that this only happens for values of attributes. With this in mind, I can manually add the namespaces like so:

from xml.etree import ElementTree

namespaces = {
    "types": "http://sdformat.org/schemas/types.xsd",
    "pose": "sdformat/pose",
    "xs": "http://www.w3.org/2001/XMLSchema"
}

for prefix, ns in namespaces.items():
    ElementTree.register_namespace(prefix, ns)

tree = ElementTree.parse("test.xsd")
root = tree.getroot()

queue = [tree.getroot()]
while queue:
    element:ElementTree.Element = queue.pop()
    for value in element.attrib.values():
        try:
            prefix, value = value.split(":")
        except ValueError:
            # no namespace, nothing to do
            pass
        else:
            if prefix == "xs":
                break  # ignore XMLSchema namespace
            root.attrib[f"xmlns:{prefix}"] = namespaces[prefix]

    for child in element:
        queue.append(child)

tree.write("test_out.xsd")

While this solves the problem, it is quite an ugly solution. I also still don't understand why this happens in the first place, so it doesn't answer the question.

FirefoxMetzger
  • 2,880
  • 1
  • 18
  • 32
  • One thing that particularly confuses me is that the XMLShema namespace `xs` is preserved, but all the other ones are dropped. – FirefoxMetzger Jul 26 '21 at 04:36
  • ElementTree removes declarations for namespaces that are not used in the XML document. The namespace associated with the `xs` prefix is actually used. See https://stackoverflow.com/q/45990761/407651 – mzjn Jul 26 '21 at 04:49
  • @mzjn But so is `types` (see, for example, the second `simpleType`). – FirefoxMetzger Jul 26 '21 at 04:54
  • @mzjn Your comment made me test using the `types` namespace in an attribute name, and once that happens the namespace is added correctly ... Why does etree ignore a namespace inside the attribute's value? – FirefoxMetzger Jul 26 '21 at 05:08
  • An attribute value such as `types:pose` signals that it is associated with a namespace, but in most cases such a value does not have any special significance. See http://www.rpbourret.com/xml/NamespacesFAQ.htm#names_5. – mzjn Jul 26 '21 at 05:22
  • @mzjn that answers my question. Thank you. – FirefoxMetzger Jul 26 '21 at 05:36

1 Answers1

2

There is a valid reason for this behaviour, but it requires a good understanding of XML Schema concepts.

First, some important facts:

  • Your XML document is not just any old XML document. It is an XSD.
  • An XSD is described by a schema (See schema for schema )
  • The attribute xs:restriction/@base is not an xs:string. Its type is xs:QName.

Based on the above facts, we can assert the following:

  • if test.xsd is parsed as an XML document, but without knowledge of the 'schema for schema' then the value of the base attribute will be treated as a string (technically, as PCDATA).
  • if test.xsd is parsed using a validating XML parser, with the 'schema for schema' as the XSD, then the value of the base attribute will be parsed as xs:QName

When ElementTree writes the output XML, its behaviour should depend on the data type of base. If base is a QName then ElementTree should detect that it is using the namespace prefix 'types' and it should emit the corresponding namespace declaration.

If you are not supplying the 'schema for schema' when parsing test.xsd then ElementTree is off the hook, because it cannot possibly know that base is supposed to be interpreted as a QName.

kimbert
  • 2,376
  • 1
  • 10
  • 20