Best practice for storing and indexing 1M+ XML documents?

Question

I have an archive of several years' worth of XML documents. There are 1M+ unique document subjects, and each subject may have one or more documents for any given year. Each document contains hundreds of nodes and parameters. Total XML cache is about 50GB in size.

I need to build a system that stores and indexes these documents, allowing searches filtering on various parameters (and which can be expanded in scope over time).

To achieve this I certainly have to use some sort of indexed DBMS. I considered building a tool to import the XML files into a relational database like MySQL, but this seems like a brittle and overly complicated solution.

I've heard ElasticSearch and MongoDB mentioned as possible solutions, but I'm not familiar enough with their feature sets to determine if either one is the optimal solution.

What is the best practice, optimal solution for storing, indexing, and searching an XML dataset at this scope?

Best practice is to use an XML database (examples include MarkLogic, eXistDB, and BaseX): but advice on choosing a particular product is outside the scope of StackOverflow. — Michael Kay, Jul 03 '16 at 22:36
Is there a reason to choose an XML database over ElasticSearch? — MarathonStudios, Jul 04 '16 at 04:28
I don't know ElasticSearch, but there are big benefits in having a database that retains the logical structure of your data without transformation and that uses that structure as the fundamental data model underlying its query language. — Michael Kay, Jul 04 '16 at 07:47

score 0 · Accepted Answer · answered Jul 04 '16 at 07:59

Both elasticsearch and MongoDb can be considered NoSQL(Not only SQL) databases which allows to handle large amounts of data efficiently.

According to CAP theorem MongoDB gives priority to consistency and partition tolerance while elasticsearch give room to availability and partition tolerance. You have to decide what suits your needs the best.

If you are looking for a secondary storage to query upon elsticsearch is a good choice. It's fast and every request will get a response. Elaticsearch becomes eventually consistent. If you need the response to be accurate all the time you will like MongoDb. It gives priority to consistency.

score 0 · Answer 2 · answered Aug 02 '16 at 04:20

1) I will store xml in file system. 2) I will write a xml parser and store each attribute in mongodb with proper index. 3) I will use mongodb and index required attribute in elasticsearch with appropriate tokenizer.

Remember, Mongodb is meant to store data, you can implement search but performance is not good. Elasticsearch is obivious from name.

Hope this answer your question.

Best practice for storing and indexing 1M+ XML documents?

2 Answers2