What is the most efficient way of extracting information from a large number of xml files in python?

Question

I have a directory full (~10³, 10⁴) of XML files from which I need to extract the contents of several fields. I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.

Is there a more efficient way? (simple text matching doesn't work)
Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
Any caveats?

Thanks!

Can you give a little more information about the files? Are they identical? Do all of them have contain the needed information? Why is text matching useless? An example or two would help as well. — muhuk, Dec 05 '08 at 20:04
What other parsers did you try? For a very similar purpose, I tested `xml.dom.ext.reader` and Python bindings of libxml2 and libxml2 was much faster. — bortzmeyer, Dec 08 '08 at 12:28
@muhuk: text matching is useless because of XML-specific things, for instance searching for "foo" with text matching won't discover "foo" even if it is the same thing in XML. — bortzmeyer, Dec 08 '08 at 12:29

score 4 · Answer 1 · answered Dec 08 '08 at 13:01

Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.

The important thing about iterparse is that you get the element (sub-)structures as they are parsed.

import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()

event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.

Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.

If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.

orip · Accepted Answer · 2008-12-06T09:05:10.823

3

The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.

But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.

EDIT:

Are the files on a local drive or network drive? Network I/O will kill you here.
The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).

edited Dec 06 '08 at 09:05

answered Dec 05 '08 at 18:08

orip

73,323
21
116
148

Hi, I had thought of that and that's how I picked expat (it was the fastest I found). I clarified the question a bit to reflect this. I'm wondering if there is anything I'm missing, or if there are any tricks that I can use that will speed things up. – bgoncalves Dec 05 '08 at 18:41
Regular expressions will work on only a very small subset of XML documents, and are full of hidden defects unless you really, really know what you're doing (e.g. you know how you're going to handle all permutations of encoding and whitespace). – Robert Rossney Dec 06 '08 at 00:46
@Robert - or unless your XMLs happen to be generated in a way that makes it easy. That's the difference between accepting any XML matching the schema, and a bunch of XMLs that happen to all be generated the same. – orip Dec 06 '08 at 09:03

score 1 · Answer 3 · answered Dec 05 '08 at 17:49

If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.

Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.

Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.

score 1 · Answer 4 · answered Dec 06 '08 at 00:52

1

One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.

answered Dec 06 '08 at 00:52

Robert Rossney

94,622
24
146
218

On the other hand, SAX is much more complicated than DOM for the programmer. – bortzmeyer Dec 08 '08 at 12:30

What is the most efficient way of extracting information from a large number of xml files in python?

4 Answers4