Parsing a subnode with PyXB

Question

Using PyXB, I'd like to serialize a sub node and then be able to parse it back. The naive way isn't working, because the sub node is not a valid root element according to the schema.

My schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="root" type="Root"/>

  <xsd:complexType name="Root">
    <xsd:sequence>
      <xsd:element name="item" maxOccurs="unbounded" type="Item"/>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:complexType name="Item">
    <xsd:sequence>
      <xsd:element name="val"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

And sample XML:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <item>
        <val>1</val>
    </item>
    <item>
        <val>2</val>
    </item>
    <item>
        <val>3</val>
    </item>
</root>

I need to be able to serialize a specific item and then load it back. Something like this:

>>> root = CreateFromDocument(sample)
# locate a sub node to serialize
>>> root.item[1].toxml()
'<?xml version="1.0" ?><item><val>2</val></item>'
# load the sub node, getting an Item back
>>> sub_node = CreateFromDocument('<?xml version="1.0" ?><item><val>2</val></item>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "binding.py", line 63, in CreateFromDocument
    instance = handler.rootObject()
  File "pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
pyxb.exceptions_.UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x7f30ba4ac550>

# or, perhaps, some kind of unique identifier:
>>> root.item[1].hypothetical_unique_identifier()
'//root/item/1'
>>> sub_node = CreateFromDocument(sample).find_node('//root/item/1')
<binding.Item object at 0x7f30ba4a5d50>

This of course doesn't work because item can't be root node according to the schema. Is there a way to parse just a sub tree, getting an Item instead back?

Alternatively, is there some way to uniquely identify a sub node so I can find it later?

score 0 · Answer 1 · answered Oct 09 '15 at 20:18

PyXB can't parse a document that begins with an element that isn't global since the validation automata states for non-global elements aren't start states.

Though I'd originally thought of supporting something like XPath it never got implemented, nor is there a standard unique identifier that carries structural information. If your need is to mark a member element so you can remove it then later put it back where it came from, you could just assign additional properties to an object and use them at the application level; e.g.:

e = root.item[1]
e.__mytag = '//root/item/1'

You could then write a function that walks the object tree searching for a match. Such an attribute would, of course, remain associated only with that instance, so subsequently assigning a different object to root.item[1] would not automatically inherit the same attribute.

I need to mark it in a persistent way, so when the document is parsed later the node can be identified. — Gavin Wahl, Oct 09 '15 at 20:53
Then you need to do that at the XML level, probably by adding an attribute to the element, since anything done solely at the PyXB object instance level won't be retained in the document expression. One way would be to create a new namespace, declare attributes that carry the information you need, and use the (apparently undocumented) [_setAttribute method](http://pyxb.sourceforge.net/api/pyxb.binding.basis.complexTypeDefinition-class.html) to add it to the instance. This assumes the underlying schema permits wildcard attributes; if not, it's a much harder problem to solve. — pabigot, Oct 10 '15 at 11:05

score 0 · Answer 2 · answered Oct 09 '15 at 20:55

The way I ended up doing this was by using the starting line and column number of the element to identify it.

I added this mixin to all my elements:

class IdentifierMixin(object):
    """
    Adds an identifier property unique to this node that can be used to locate
    it in the document later.
    """
    @property
    def identifier(self):
        return '%s-%s' % (self._location().lineNumber, self._location().columnNumber)

And then used this function to look up nodes later:

def find_by_identifier(root, identifier):
    # BFS over the tree because usually the identifier we're looking for will
    # be close to the root.
    stack = collections.deque([root])
    while stack:
        node = stack.popleft()
        if node.identifier == identifier:
            return node
        stack.extend(node.content())

That'd work; just be aware that it's only unique for that specific document. If the schema is not deterministic the textual location of a given element may be different in different results from invoking `toxml` on the same Python binding instance. — pabigot, Oct 10 '15 at 11:09

Parsing a subnode with PyXB

2 Answers2