Choosing the right NoSQL storage for highly connected and flexible domain

Question

We're starting a new project and looking for an appropriate storage solution for our case. Main requirements for the storage are as follows:

Ability to support highly flexible and connected domain
Ability to support queries like "give all children of that item and items linked to that children" in ms
Full text search
Ad hoc analytics
Solid read and write performance
Scalability (as we want to offer a Saas version of our product)

First of all we eliminated all RDBMS, since we have really flexible schema which can also be changed by the customer (add new fields etc.), so supporting such solution in any RDBMS can become a nightmare... And we came to NoSQL. We evaluated sevaral NoSQL storage engines and chose 3 most appropriate (as we think).

MongoDB

Pros:

Appropriate to store aggregates with flexible structure (as we have them)
Scalability/Maturity/Support/Community
Experience with MongoDB on previous project
Drivers, cloud support
Analitycs
Price (it's free)

Cons:

No support for relationships (relly important for us as we have a lot of connected items)
Slow retrieval of connected data (all joins happen in app)

Neo4j:

Pros:

Support of conencted data in modeling, flexibility
Fast retrieval of interconnected data
Drivers, cloud support
Maturity/Support/Comminity (if we compare with other graph Dbs)

Cons:

No support for aggregate storage (we would like to have aggregates in one vertex than in several)
Scalability (as far as I know, now all data is duplicated on other servers)
Analitics ?
Write performance ? (read several blogs where customers complained on its write performance)
Price (it is not free for commercial software)

OrientDB

Pros:

It seems that OrientDB has all the features that we need (aggregates and graphdb in one solution)
Price (looks like is't free)

Cons:

Immaturity (comparing with others)
Really small company behind the technology (In particular one main contributor), so questions about support, known issues etc.
A lot of features, but do they work pretty well

So now, the main dilemma for as is between Neo4j and OrientDB (MongoDb is a third option because its lack of relationships that are really important in our case - this post explains the pitfalls). I've searched for any benchmarks/comparison of these dbs, but all all of them are old. Here is a comparison by features http://vschart.com/compare/neo4j/vs/orientdb. So now we need an advice from people who already used these dbs, what to choose. Thanks in advance.

score 2 · Answer 1 · answered May 22 '14 at 12:16

I think there are interesting trade-offs with each of these:

MongoDB doesn't do graphs;
Neo4j's nodes are flat key-value properties;
OrientDB forces you to choose between graphs and documents (can't do both simultaneously).

So your choice is between a graph store (neo4j or orient) and a document store (mongo or orient). My sense is that MongoDB is the leading document store and Neo4j is the leading graph database which would lead me to pick one of thse. But since connectivity is important, I'd lean towards the graph database and take Neo4j.

Neo4j's scalability is proven: it's in use for graphs larger than Facebook's and by enormous companies like Walmart and EBay. So if your problem is anywhere between 0-120% of Facebook's social graph, Neo4j has you covered. Write throughput is fine with Neo4j - I get in excess of 2,000 proper ACID Transactions per second on a laptop and I can easily queue writes to multiply that out.

Everything else is pretty equal: you can choose to pay for any of these or use them freely under their open source licenses (including Neo4j if you can work with GPL/AGPL). Neo4j's paid licenses have great support (up 24x7x365, 1 hour turnaround worldwide) versus OrientDB's rather lacklustre support (4 hour turnaround in the EU daytime only), and I imagine MongoDB has good support too (though I have not checked up on it).

In short, there's a reason Neo4j is the top database for connected data: it kicks ass!

Jim

Thank you Jim, good points. "OrientDB forces you to choose between graphs and documents" - this sounds strange because creator of OrientDB says the opposite http://stackoverflow.com/questions/20825656/orientdb-as-a-document-graph-database — Voice, May 22 '14 at 14:11

score 2 · Answer 2 · answered May 22 '14 at 14:53

To correct some misconceptions regarding mongoDB

Relations are supported, by either linking to other documents or embedding them. Please see the Data Modeling Introduction in the mongoDB docs for details. It may be that you are forced to trade normalization against speed, though. However, there are use cases in which embedding is the better solution compared to relations. Think of orders: When embedding order items and their price, you do not need to have a price history table for each and every product ever sold.
What is not supported are JOINs. Which you can circumvent by embedding documents, as mentioned above.
MongoDB can be used for tree structures. Please see Model Tree Structures with Materialized Paths for details. This approach seems to be the most appropriate way to implement a tree structure for the mentioned use case. An alternative may be an array of ancestors, depending on your needs.

That being said, mongoDB may fail in one of the basic requirements, though this really depends on how you define it: ad hoc analysis. My suggestion would be to model the intended data structure using a document oriented approach (in opposite of putting a relational approach on a document oriented database) and prototype one of the possible analysis use cases with dummy data.

If you have an embedded document it becomes a part of an aggregate. But in our case there are a lot of linked aggregates which system should traverse pretty fast — Voice, May 22 '14 at 16:16
Granted. What I wanted to point out that sometimes a smart schema modeling eliminates problems. What sometimes happens ist that developers already have a relational schema in mind and are trying to adapt it "so it works with mongoDB". From my experience, rethinking the schema most of the times eliminates the "problems". In the case of heavily linked aggregates, this might very well be the case. When talking of traversing, I would assume that we have a tree structure one way or the other. An index on an array of ancestors should be extremely fast for reads. — Markus W Mahlberg, May 22 '14 at 16:34

Choosing the right NoSQL storage for highly connected and flexible domain

MongoDB

Neo4j:

OrientDB

2 Answers2