Multidimensional analysis in Hive/Impala

Question

I have a denormalized table say Sales that looks like:

SalesKey, SalesOfParts, SalesOfEquipments, CostOfSales as some numeric measures Industry, Country, State, Sales area, Equipment id, customer id, year of sale, month of sale and some more similar dimensions. (Total of 12 dimensions)

I need to support aggregation queries on the Sales, like total number of sales in a year, month... total cost of them etc. Also these aggregates need to be filtered, i.e. something like total sales in year 2013, 04 belonging to Manufacturing industry of XYZ customer.

I have these dimension tables and facts in hive/impala.

I do not think I can make a cube on all the dimensions. I read a paper to see how to do OLAP over multiple dimensions : http://www.vldb.org/conf/2004/RS14P1.PDF

Which basically suggests to materialize cubes over small fragments and do some kind of runtime computation when query spans multiple cubes.

I am not sure how to implement this model in Hive/Impala. Any pointers/suggestions will be awesome.

EDIT: I have about 10 million rows in the Sales table, and the dimensions are not comparable to 100, but are around 12 ( might go upto 15) but have a good cardinality each.

Perhaps it is worth mentioning the sizing of your data and the required response time of your queries. The referenced paper is about a scenario with more than 100 dimensions and quite a small amount of 10e6 tuples. Does it apply to your usecase? — Marc Polizzi, Aug 01 '14 at 05:12
Added the sizing of data... I don't have response requirements yet. — Dhruv Kapur, Aug 01 '14 at 16:16

score 1 · Answer 1 · answered Aug 02 '14 at 05:09

1

I would build cubes using a 3rd-party software. For example, icCube is an in-memory OLAP server that can handle with no issue at all 10mio of rows over 12 dimensions. Then the response time would be sub-second in all dimensions. Moving out from Hive 10mio of rows does not seem to be an issue (you could use the JDBC driver for that purpose). icCube is specifically designed to handle properly high sparsity.

answered Aug 02 '14 at 05:09

Marc Polizzi

9,275
3
36
61

Ok. Thank you. I'll look into icCube. I'm interested in how does this cube computation happen? Aren't 12 columns of just 5 cardinality too huge to compute and store? Thank again. – Dhruv Kapur Aug 02 '14 at 06:29
All aggregations happens on the fly; icCube is storing in RAM the 10mio rows. 12 dimensions is fine. – Marc Polizzi Aug 02 '14 at 15:31
If its about storing in memory I think I already have a tool for that. Impala will do it too. Just clarifying then, there are no cubification that is happening? – Dhruv Kapur Aug 02 '14 at 17:29
Yes icCube is about making OLAP cubes; the aggregations are performed on the fly during the processing of the requests with sub-seconds response time. icCube is supporting standard MDX language. – Marc Polizzi Aug 03 '14 at 05:44

Multidimensional analysis in Hive/Impala

1 Answers1