1

Before I ask my question, a little background about our problem.

We are developing a government revenue projection application. At this application, we collect data about previous revenue, apply several econometric and political projection models (through several phases) while persisting it to the database.

In short, this is a simplified model for the projection system:

Several lines (at start, ~6.000 lines) with this schema represents past revenue (called scenario):

+------+------+------+------+------+-------+---------+
| Cat1 | Cat2 | Cat3 | Cat4 | Year | Month | Revenue |
+------+------+------+------+------+-------+---------+

Throughout the projection system, the data is transformed in several ways (moving categories around, changing revenue values, fixing atypicality etc.). Each transformation is done in a phase:

Initial Scenario (S0) ---(1st transformation phase)--> Transformed Scenario (S1)
S1 ---(2nd t.p.)--> S2 ---> S3 ---...---> SN

Each phase transforms a piece of the scenario, this piece size varying from 2-100% of the data, and the partial scenario states must be persisted until its final state (SN) is achieved. The partial state can be only the transformed lines or the entire partial scenario (as long as it is possible to compare partial states).

Also, the user can go back into the process (say, back to phase 2) and restart the projection from there, ignoring the work that was previously done.

With this use case we faced a problem with RDBMS: they are really slow for write operations (taking as long as half an hour to do a scenario projection).

After reading about NoSQL DBMS, we arrived at several options, but, as I'm still on the learning curve, I'd like to ask: what's the best choice to use on this use case: VoltDB, Redis, Riak, Cassandra, MongoDB or HBase?

Thanks in advance.

Carlos Melo
  • 3,052
  • 3
  • 37
  • 45

1 Answers1

2

I'll chip in with my solution from a MongoDB perspective. Maybe others can add their experience from other databases.

Mongo has a map/reduce framework that would allow you to run pretty complex queries against your data. The nice thing about their map/reduce stuff is that you can output the results to a collection (analogous to a table in a RDBMS) in a database.

So it would be:

  1. Run map/reduce job 1 => outputs to collection A.
  2. Run map/reduce job 2 against collection A => outputs to collection B.
  3. Run map/reduce job 3 against collection B => outputs to collection C.
  4. ...
  5. Profit!

Each step will produce a persisted set of data. That allows you to pick up where you left off. You would of course have to clean up after you're done, but it could work.

I'm not a Hadoop expert, but your scenario sounded like a good fit for their eco-system. Running complex jobs in stages with intermediate outputs.

ryan1234
  • 7,237
  • 6
  • 25
  • 36
  • We are seriously considering MongoDB for the job. It did perform well, we just want to see how good it scales when more nodes are added to the cluster. From your experience, do you think MongoDB can handle a medium sized data set (something between 500G-1TB)? – Carlos Melo Mar 15 '13 at 21:32
  • I believe it would handle that size, but the data at my company is far below that. A replica set can have 12 nodes (http://docs.mongodb.org/manual/core/replication/). After that you'll need to shard and divide amongst replica sets. (http://docs.mongodb.org/manual/core/sharded-clusters/). Read up on the working set in Mongo too - http://stackoverflow.com/questions/6453584/what-does-it-mean-to-fit-working-set-into-ram-for-mongodb. Mongo wants to load the entire database (with indexes) into memory. Might influence your decision. – ryan1234 Mar 16 '13 at 02:54
  • Awesome. Actually, loading the entire database in memory is a good thing for us since it is possible to shard the data. Thanks! – Carlos Melo Mar 18 '13 at 17:11