lxml, xi:include, and original file

Question

I'm using lxml to parse a file that contains xi:include elements, and I'm resolve the includes using xinclude().

Given an element, is there any way to identify the file and source line that the element originally appeared in?

For example:

from lxml import etree
doc = etree.parse('file.xml')
doc.xinclude()
xpath_expression = ...
elt = doc.xpath(xpath_expression)
# Print file name and source line of `elt` location

score 0 · Answer 1 · answered Aug 10 '13 at 03:53

The xinclude expansion will add an xml:base attribute to the top level expanded element, and elt.base and elt.sourceline are also updated for the child nodes as well, so:

print elt.base, elt.sourceline

will give you what you want.

If elt is not part of the xinclude expansion, then elt.base will point to the base document ( 'file.xml' ) and elt.sourceline will be the line number in that file. ( Note that sourceline usually seems to actually point to the line where the element tag ends, not to the line where it begins, if the element is on multiple lines, just as validation error messages usually point to the closing tag where the error occurs. )

You can find the initial xincluded elements and check this with:

xels = doc.xpath( '//*[@xml:base] )
for x in xels: 
     print x.tag, x.base, x.sourceline
     for c in x.getchildren():
             print c.tag, c.base, c.sourceline

Bryan A. Jones · Answer 2 · 2021-10-18T15:41:31.140

Sadly, current versions of lxml no longer include this ability. However, I've developed a workaround using a simple custom loader. Here's a test script which demonstrates the bug in the approach above along with the workaround. Note that this approach only updates the xml:base attribute of the root tag of the included document.

The output of the program (using Python 3.9.1, lxml 4.6.3):

Included file was source.xml; xinclude reports it as document.xml
Included file was source.xml; workaround reports it as source.xml

Here's the sample program.

# Includes
# ========
from pathlib import Path
from textwrap import dedent
from lxml import etree as ElementTree
from lxml import ElementInclude


# Setup
# =====
# Create a sample document, taken from the `Python stdlib 
# <https://docs.python.org/3/library/xml.etree.elementtree.html#id3>`_...
Path("document.xml").write_text(
    dedent(
        """\
        <?xml version="1.0"?>
        <document xmlns:xi="http://www.w3.org/2001/XInclude">
            <xi:include href="source.xml" parse="xml" />
        </document>
        """
    )
)

# ...and the associated include file.
Path("source.xml").write_text("<para>This is a paragraph.</para>")


# Failing xinclude case
# =====================
# Load and xinclude this.
tree = ElementTree.parse("document.xml")
tree.xinclude()

# Show that the ``base`` attribute refers to the top-level 
# ``document.xml``, instead of the xincluded ``source.xml``.
root = tree.getroot()
print(f"Included file was source.xml; xinclude reports it as {root[0].base}")


# Workaround
# ==========
# As a workaround, define a loader which sets the ``xml:base`` of an
# xincluded element. While lxml evidently used to do this, a change
# eliminated this ability per some `discussion 
# <https://mail.gnome.org/archives/xml/2014-April/msg00015.html>`_, 
# which included a rejected patch fixing this problem. `Current source 
# <https://github.com/GNOME/libxml2/blob/master/xinclude.c#L1689>`_ 
# lacks this patch.
def my_loader(href, parse, encoding=None, parser=None):
    ret = ElementInclude._lxml_default_loader(href, parse, encoding, parser)
    ret.attrib["{http://www.w3.org/XML/1998/namespace}base"] = href
    return ret


new_tree = ElementTree.parse("document.xml")
ElementInclude.include(new_tree, loader=my_loader)

new_root = new_tree.getroot()
print(f"Included file was source.xml; workaround reports it as {new_root[0].base}")

lxml, xi:include, and original file

2 Answers2