2

Suppose I have very big XML file with entries having <id> tags or id="" properties.

How to search by this id? Can I create some search index or something.

Currently I am using org.w3.dom. Does it have some means for searching?

UPDATE

My big XML file is a downloaded Wikipedia. It is 40G size and has millions of records.

Is it possible to index it with something like Lucene and then search for IDs fast?

UPDATE2

Have tried BaseX. It ate my XML and created database of 32Gb. Haven't understand if it truncated data or 32Gb is because of some compressing.

Unfortunately, searching by ID requires 70-80 seconds or longer. So it is longer than Mediawiki API query.

Suzan Cioc
  • 29,281
  • 63
  • 213
  • 385
  • If you can use DOM on your XML, how big is it? Usually its 10x bigger as a DOM object in memory. If you have a DOM you can build a `Map` – Peter Lawrey Feb 03 '13 at 09:38
  • I have not started to work with big XML yet. I am using DOM with small XMLs. Big one is a downloaded Wikipedia, it has millions of pages and 40G size. I need to index it once and then use an index. – Suzan Cioc Feb 03 '13 at 09:45
  • In that case you need to parse all the documents and store the ids and where they can be found in a Map or database or both as you prefer. I would use a SAX parser as it's likely to be more efficient. – Peter Lawrey Feb 03 '13 at 09:46
  • 1
    See http://stackoverflow.com/questions/11210600 – Mark O'Connor Feb 03 '13 at 09:48
  • This means converting XML to database. Then it would be better to download not an XML but a database. I would like to find a way to work with XML first. Aren't there some means to index XMLs??? – Suzan Cioc Feb 03 '13 at 09:49
  • Take a look here(http://stackoverflow.com/questions/4842813/use-xml-as-database-in-java) – Sergiy Medvynskyy Feb 03 '13 at 09:51

1 Answers1

3

So in order to read and write XML file, you need to parse data inside first. There are different types of parsers and major ones are DOM, SAX, StAX.

I wouldn't recommend DOM parser for XML parsing especially when it comes to parsing a large XML file. Because DOM parser reads everything into your memory first and then try to read data from it. Which is extremely inefficient if your XML files are really large. SAX and StAX parsers are basically improved version of DOM. Have a read on StAX parser in Java from here

StAX parser tutorial

I think StAX parser is the most suitable parser for reading large XML file.

FYI, here is a link to SAX parser too

SAX parser tutorial in Java

Jason
  • 1,298
  • 1
  • 16
  • 27