Best NoSQL approach to handle 100+ million records

Question

I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes). To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).

What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?

Some options I am thinking of are:

Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id. What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.

If lookup by ID is a potential option with BigTable or MongoDB, why isn't it a potential option with SQL? — Matt Ball, Jun 23 '11 at 17:51

score 8 · Answer 1 · edited Dec 21 '12 at 13:08

8

recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.

edited Dec 21 '12 at 13:08

Petrus Theron

27,855
36
153
287

answered Jun 23 '11 at 18:44

sirmak

3,749
4
30
34

score 3 · Answer 2 · answered Jul 04 '11 at 17:07

It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).

You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.

Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.

So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.

Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.

Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.

My advice would be to actually play around with a few different systems. Look at;

MongoDB - Document - CP

CouchDB - Document - AP

Redis - In memory key-value (not column family) - CP

Cassandra - Column Family - Available & Partition Tolerant (AP)

HBase - Column Family - Consistent & Partition Tolerant (CP)

Hadoop/Hive

VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).

Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

score 1 · Answer 3 · answered Jun 24 '11 at 11:39

As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.

score 1 · Answer 4 · answered Jun 24 '11 at 16:05

You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.

score 0 · Answer 5 · answered Jun 23 '11 at 18:08

you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.

e.g.,

say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.

for this to work you need a load balancer (which balances load by business scenario).

not sure how much of it can be implemented here.

score 0 · Answer 6 · answered Jun 23 '11 at 20:26

0

At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.

What kind of queries do you execute on your table?

answered Jun 23 '11 at 20:26

Olaf

6,249
1
19
37

Here is an example of runtime of same method modified to use in memory database vs going all the way to Oracle: 116,201ms vs 20ms (the 116201ms is spent on oracle.jdbc.driver.OraclePreparedStatement.executeQuery() according to yourkit). My goal is to come as much possible close to 20ms. – tsolakp Jun 23 '11 at 20:58
@Tsolak Petrosian: If your performance goal is tens of milliseconds for searches on a moderately large 100M records table, you probably should consider in-memory databases or caches rather than just NoSQL. – Olaf Jun 23 '11 at 21:24

Best NoSQL approach to handle 100+ million records

6 Answers6