XPath Child Traversal Methods and Performance

Question

I'm using lxml on Python 2.7.

Given a node, node and a child, child_element, what is the difference between these: node.xpath('./child_element')

node.xpath("*[local-name()='child_element']")

In other words, what's going on under the hood here? Is there any reason one ought to be "better" than another (in terms of performance or correctness)?

I've read through the lxml docs and a good deal of other XPath query resources and am not finding any real clarification.

PascalVKooten · Accepted Answer · 2015-07-17T19:36:23.043

It's a good question, with not an easy to find answer.

The main difference is that local-name() does not consider prefixes (namespaces) for tags.

For example, given a node <x:html xmlns:x="http://www.w3.org/1999/xhtml"/>, the local-name will match the html tag, while //html will not work, and neither will //x:html.

Please consider the following code, if you have any questions feel free to ask.

Show me the code

Setup:

from lxml.etree import fromstring
tree = fromstring('<x:html xmlns:x="http://www.w3.org/1999/xhtml"/>')

It is now not possible to use the tag selector:

tree.xpath('//html')
# []

tree.xpath('//x:html')
# XPathEvalError: Undefined namespace prefix

But using local-name we can still get the element (considering the namespace)

tree.xpath('//*[local-name() = "html"]')
# [<Element {http://www.w3.org/1999/xhtml}html at 0x103b8d848>]

Or strict namespace using name():

tree.xpath('//*[name() = "x:html"]')
# [<Element {http://www.w3.org/1999/xhtml}html at 0x103b8d848>]

Performance

I parsed this website as a tree and used the following queries:

%timeit tree.xpath('//*[local-name() = "div"]')
# 1000 loops, best of 3: 570 µs per loop

%timeit tree.xpath('//div')
# 10000 loops, best of 3: 44.4 µs per loop

Now onto actual namespaces. I parsed a block from here.

example = """ ... """
from lxml.etree import fromstring
tree = fromstring(example)

%timeit tree.xpath('//hr:author', 
                   namespaces = {'hr' : 'http://eric.van-der-vlist.com/ns/person'})
# 100000 loops, best of 3: 18.2 µs per loop

%timeit tree.xpath('//*[local-name() = "author"]')
# 10000 loops, best of 3: 37.7 µs per loop

Conclusion

I had to rewrite to conclusion since after using the namespace method it became obvious that the gain when using namespaces is also there. Roughly 2 times faster when specifying the namespace (causing optimizations), rather than using local-name.

Thank you. I think this answer is nearly complete, but it really begs the question of whether there's a difference if you take into account the namespaces argument. So, is there a performance or correctness difference under the hood between `node.xpath('./child_element', namespaces=something)` and `node.xpath("*[local-name()='child_element']")`? I tried to read the source code to find out but don't understand Cython very well. — AutomaticStatic, Jul 16 '15 at 23:00
I mostly believe in tests by simulation when it gets complicated... If you have an example it will be easy to run a benchmark. It will most likely also best show you if it will be worth it anyway. — PascalVKooten, Jul 16 '15 at 23:11
The comparison with the namespaces arg would be `tree.xpath('//x:html', namespaces={'x': 'http://www.w3.org/1999/xhtml'})` compared to the local name version. — AutomaticStatic, Jul 17 '15 at 00:33
@AutomaticStatic There you go, I think that's the quantification you were looking for :)? — PascalVKooten, Jul 17 '15 at 19:36

XPath Child Traversal Methods and Performance

1 Answers1

Show me the code

Performance

Conclusion