Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

Question

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:

from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
    print tag.xpath('bar/baz')[42] # there's actually a function call here

The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.

For now I've been putting up with the ugly

for e, tag in etree.iterparse(source):
    if tag.tag.endswith('Foo'):
        print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]

but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)

Is there a way to write sane code that would account for both cases using iterparse? For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.

P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.

Do elements passed in at start events include namespaces? Could you check those for the nsxml attributes? — Martijn Pieters, Sep 08 '12 at 17:01
LXML has the unfortunate limitation that it only supports XPath 1.0. In XPath 2.0, you could say `*:bar/*:baz`. Your check for `local-name()` is the [usual XPath 1.0 workaround](http://developer.marklogic.com/blog/namespace-wildcards-in-xpath). — Fred Foo, Sep 08 '12 at 17:03
@MartijnPieters Namespace is either declared in the root or not declared at all. But I think you're right that I can check `tag.nsmap` on `end` events (did you mean `end`? this is what is caught by default). This leaves the `tag='Foo'` part unsolved, but seems easier overall. This may well be the best answer for the question. — Lev Levitsky, Sep 08 '12 at 17:09
@LevLevitsky: I meant start, you'd have to ask for start events too; but you are right, end will do. Namespaces *can* be declared on any element, btw. — Martijn Pieters, Sep 08 '12 at 17:10
@larsmans Thanks for the link, I'm a bit relieved. Still, having it 12 times in a file makes me sad. — Lev Levitsky, Sep 08 '12 at 17:14
@LevLevitsky: I'm afraid it's just the way it is. When parsing XML in Python (with LXML, or with the more space-efficient built-in ElementTree), I tend to build the path expressions dynamically after inspecting the root node for its namespace. — Fred Foo, Sep 08 '12 at 17:17
@MartijnPieters and larsmans: Guys, your comments have been of great help; too bad they're not answers so I can't upvote. — Lev Levitsky, Sep 08 '12 at 17:22

score 2 · Accepted Answer · answered Sep 08 '12 at 17:24

2

All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.

answered Sep 08 '12 at 17:24

Martijn Pieters

1,048,767
296
4,058
3,343

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

1 Answers1

Linked