Random access to large XML files

Question

I have a set of tools which index large XML file (MediaWiki dump files) and use those indeces for random access to the individual records stored in the file. It works very well but I'm "parsing" the XML with string functions and/or regular expressions rather than a real XML parser which is a fragile solution should the way the files are created be changed in the future.

Do some or most XML parsers have ways to do such things?

(I have versions of my tools written in C, Perl, and Python. Parsing the entire files into some kind of database or mapping them into memory are not options.)

UPDATE

Here are rough statistics for comparison: The files I am using are mostly published each week or so, the size of the current one is 1,918,212,991 bytes. The C version of my indexing tool takes a few minutes on my netbook and only has to be run once for each new XML file published. Less often I use the same tools on another XML file whose current size is 30,565,654,976 bytes and was updated only 8 times in 2010.

score 1 · Accepted Answer · answered Jan 02 '13 at 13:45

1

VTD-XML looks to be the first serious attempt at addressing this problem:

The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

(VTD-XML even has its very own tag here on StackOverflow so you can follow questins about it etc: vtd-xml)

answered Jan 02 '13 at 13:45

hippietrail

15,848
18
99
158

1

I wonder why in one year nobody has commented on this answer. Is this usecase just so uncommon? Has this worked out for you @hippietrail? – fho Feb 27 '14 at 11:33
@Florian: I never actually tried it. There was no implementation/library/glue in the programming language I was playing with at the time and the format of the Wikipedia XML dump files still hasn't changed to not work with my old simplistic method. But I agree with you that it seems odd nobody here on SO seems to mention such usecases ... – hippietrail Feb 27 '14 at 12:48
Maybe it's just a matter of chosing the right tool for the right job. XML just isn't very good at random access *and* big files at the same time. Otoh if I just have a big dump of XML I don't have much choice if I am asked to provide random access in these files. Buying more RAM and using DOM was considered, but in the end we create veeeery large files on serveral computers and buying more memory is just delaying the problem. – fho Feb 27 '14 at 13:18

score 1 · Answer 2 · answered May 05 '11 at 14:21

1

I think you should store this data in an XML database such as exist-DB, rather than creating your own tools to do a very small subset of what an XML database gives you.

answered May 05 '11 at 14:21

Michael Kay

156,231
11
92
164

Can you give some reasons? I don't need to do more than the very small subset of things. I'm going to read up on this exist-DB but how would it compare for speed? Obviously it would at least mean doubling the required storage space. – hippietrail May 05 '11 at 14:29

score 1 · Answer 3 · answered Aug 16 '11 at 04:42

If you're using Python, try lxml - it's very fast and flexible, and it will compare quite well with regexes for speed. Much faster than the alternatives, in any language - without compromise.

Use iterparse to step through the wikipedia articles.

Note that this does not give your random access to the articles in your dump (which is a perfectly reasonable request!) - but iterparse will give you a fast and easy to use 'forward-only' cursor... and lxml might be the right tool to use to parse chunks fseek'd to through other means.

Here's the best documentation I've found for it:

http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/index.html

(try the pdf version)

It's now part of the standard python distribution.

Hmm could be useful for creating the index if the regex breaks at some point but not for the actual random access as you point out. — hippietrail, Aug 16 '11 at 09:18

score 0 · Answer 4 · answered May 05 '11 at 12:51

XML is a structured format. As such random access does not really make much sense - you must know where you are going.

Regular expression also needs the whole string to be loaded into memory. This is still better than DOM since DOM usually takes 3-4 times more memory than the size of the XML file.

Typical solution for these cases is SAX where these have a really small memory foot-print but they are like a forward-only cursor: hence you are not accessing randomly, you have to traverse the tree to get where you need. If you are using .NET, you can use XmlTextReader.

Indexes are also useful if the XML does not update often since creating such indexes can be expensive.

If it really wouldn't make much sense, than why would W3 put so much effort in defining random access on binary representations of XML? See (among others) http://www.w3.org/TR/xbc-properties/#random-access; — Abel, Mar 10 '12 at 14:17

score -1 · Answer 5 · answered May 05 '11 at 12:52

-1

XPath is far better than string/regex "parsing", but xpath works with xml documents being parsed into memory DOM first, if your documents are really large you might get memory problems.

answered May 05 '11 at 12:52

Karl-Bjørnar Øie

5,554
1
24
30

Random access to large XML files

5 Answers5