realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

Question

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!

Here's what I have:

1000s of datasets
dataset keys:
- all datasets have the same keys
- 1 million keys (this may later be 10 or 20 million)
dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
- a few columns are "type_id" columns, since in a particular query we may want to only include certain type_ids
web application
- user can choose which datasets they are interested in (anywhere from 15 to 1000)
- application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
- an entire dataset can be added, dropped, or replaced/updated
- would be cool to be able to add columns. But, if required, can just replace the entire dataset.
- never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
- currently two machines with 24 cores each
- eventually, want ability to also run this on amazon

I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.

partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured

As a proof of concept I tried out hadoop:

created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation

From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).

Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?

thanks!

David · Answer 1 · 2011-07-29T21:53:54.037

6

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev(X) = sqrt(E[X^2]- (E[X])^2)

this implies that you can get the stddev of AB by doing

sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

edited Jul 29 '11 at 21:53

answered Jul 26 '11 at 22:10

David

3,251
18
28

Thanks for the help. I probably wasn't clear about why I can't do pre-computation. I'm looking for the average value for a particular key and a particular column over an abrbitrary (user specified) set of datasets. For example, one user may choose 500 datasets, but a different user may choose a different 500 datasets. Even though we are calculate the stats for all the same keys and for all the same columns, the datasets (files in hadoop, or partitions in my database) is what varries. Thanks. – anish Jul 27 '11 at 14:07
@anish For the stddev and the average you can pre-calculate and store enough information to combine sets rapidly - just track the sum of the values, the sum of the squares and the total count. – cftarnas Jul 27 '11 at 22:07
OK, I'm feeling a bit dense here, and I'm totally missing how to easily pre-calculate the data. I'm humble…can I get a little help? If I have 3 datasets (A,B,C) my possible combos are: A, B, C, ABC, AB, BC. Wouldn't I need to precompute for all permutations? With 1000s of datasets, this would be a lot of data. Alternatively, I was going to try to improve the performance of retrieving the values for specific datasets of all keys by: parallel queries, and maybe a column oriented approach (columns = datasets and corresponding values). David, thanks for telling me to give HBase a try. thx! – anish Jul 28 '11 at 17:31

score 4 · Answer 2 · answered Jan 27 '12 at 04:55

Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.

score 2 · Answer 3 · answered Jul 28 '11 at 19:36

It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do. Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time... Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.

score 2 · Answer 4 · answered Nov 15 '11 at 22:12

If I understand you correctly and you only need to aggregate on single columns at a time You can store your data differently for better results in HBase that would look something like table per data column in today's setup and another single table for the filtering fields (type_ids) row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read ( column for each table in today's setup (i.e. few thousands of columns) HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist. When you read a row you'd get all the relevant value which you can do avg. etc. quite easily

score 0 · Answer 5 · answered Mar 15 '12 at 13:33

You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

5 Answers5

Linked