I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
- 1000s of datasets
- dataset keys:
- all datasets have the same keys
- 1 million keys (this may later be 10 or 20 million)
- dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
- a few columns are "type_id" columns, since in a particular query we may want to only include certain type_ids
- web application
- user can choose which datasets they are interested in (anywhere from 15 to 1000)
- application needs to present: key, and aggregated results (avg, stddev) of each column
- updates of data:
- an entire dataset can be added, dropped, or replaced/updated
- would be cool to be able to add columns. But, if required, can just replace the entire dataset.
- never add rows/keys to a dataset - so don't need a system with lots of fast writes
- infrastructure:
- currently two machines with 24 cores each
- eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
- partitions are nice, since can easily add/drop/replace partitions
- database is nice for filtering based on type_id
- databases aren't easy for writing parallel queries
- databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
- created a tab separated file per dataset for a particular type_id
- uploaded to hdfs
- map: retrieved a value/column for each key
- reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!