4

Suppose, hypothetically that I have a star schema in a data warehouse setting. There is one VERY, VERY long fact table (think billions to trillions of rows) and several low cardinality dimension tables (think 100 dimension tables). Each fact table foreign key that points to a dimension table primary key is bitmap indexed. Each dimension table primary key is also bitmap indexed. This is all for fast joins. All pretty standard.

Suppose that the data warehouse is starting to show performance degradation. The time it takes to return results from a bitmapped join is getting worse the longer the fact table gets. The business requirement is that the fact table keeps growing (we can not move data older than a year off to archival storage)

I'm thinking of the following solutions:

  1. Hash partition the fact table, but this just holds off the inevitable growth issue temporarily.
  2. Database partition the physical star schema database as multiple schemas/databases. 1..N fact tables and their dimension copies, each holding data assigned to them via a hash(1..N) function that is executed in a separate ETL staging database to determine which database/schema the fact row (resulting from the ETL process) will go into. If any dimension changes, replicate the change to the other databases corresponding dimension. Again, this will not work as a permanent solution.
  3. Collapse the dimensions and store all dimension values directly in the fact table. Then, import the fact table to HBASE on Hadoop. You get a massive HBASE table, key value store that has no dimension tables. I would do this because joins are cost prohibitative in HBASE (so no fact to dimension joins, just enforce dimension values on the dimension columns).

Has anyone ever done this before?

Does anyone have any tips for solution #3?

Is the HBASE solution optimal as far as scaling up with fast reads?

As far as writes, I don't care for fast writes as they would be done off hours as batch processes.

If anyone has chosen solution 1 or 2, has anyone used a consistent hashing algorithm (to avoid remapping as in a plain old hash if more partitions, hash keys are created dynamically)? Dynamic growth in the number of partitions without a full remap is probably not an option (I haven't seen it done in practice as far as partitioned tables are concerned) so it just seems to me that any partition solution would lead to scaling problems.

Any thoughts, advice and experience with moving a giant fact table with many dimensions (a traditional DW star schema) to an HBASE giant dimensionless table?

Related question:

How do aggregate data collections that traditionally reside in materialized views (or alternately as separate fact tables linked to the same dimensions as the most granular fact table -- i.e. hourly/daily/weekly/monthly where the base fact table is hourly) in a data warehouse map across to HBASE?

Aggregate fact tables partial star schema

My thoughts are that since there are no materialized views in HBASE, the aggregate data collections are stored as HBASE tables that get updated/inserted into any time there are changes to the most granular, lowest level fact table.

Any thoughts on aggregate tables in HBASE? Has anyone used Hive scripts to essentially mimic the behavior of materialized views in updating aggregate column data in secondary HBASE tables that have aggregate data stored in them (i.e. daily_aggregates_fact_table, weekly_aggregates_fact_table, monthly_aggregates_fact_table) on a change to the most granular fact table?

  • 1
    why you consider HBASE and not Hive? – David Gruzman Sep 27 '12 at 07:21
  • What do you want to do with the HBase table? how are you going to access it ? what sort of questions will you want to answer? There's no "general way" to do it since HBase does not provide a general SQL-Like mechanism to query it – Arnon Rotem-Gal-Oz Sep 27 '12 at 10:47
  • 1
    Check out SO - there are bunch of advise on setting up this kind of database as a partitioned Hive table. You'll thank its (nearly) SQL capability. – Olaf Sep 27 '12 at 14:29
  • @David Guzman, I can't do straight Hive (tables) because of the the incremental refresh problem. I'm actually looking at accessing HBASE tables with Hive see http://www.cloudera.com/blog/2010/06/integrating-hive-and-hbase/ –  Sep 27 '12 at 17:04
  • @Arnon Rotem-Gal-Oz I'm going to use the HBase table as I would a collapsed star schema. What is a collapsed star schema? All the dimension descriptions would be saved in the fact table, doing away with all bitmap indexes and relational constraints from the fact table (FK) to the dimension tables (PK). I'm going to access the HBase table via Hive. There is a way to SQL access HBASE tables via hive as detailed at http://www.cloudera.com/blog/2010/06/integrating-hive-and-hbase/ –  Sep 27 '12 at 17:14
  • 1
    That's all fine but note that HIVEQL will result in map/reduce jobs not instant results. – Arnon Rotem-Gal-Oz Sep 27 '12 at 17:25
  • 1
    How frequently is your fact data updated? Do updates happen all over the place or are they concentrated in the data for the few latest days? In the cases that are more typical for the datawarehouse applications Hive table partitioning can go long ways. – Olaf Sep 27 '12 at 17:27
  • @Arnon Rotem-Gal-Oz I'm looking at building aggregate tables via HIVEQL -> map-reduce. But I also need fast results when looking for individual rows/entries in my base (most granular/detailed) HBASE table. What should I do for getting fast results? –  Sep 27 '12 at 18:04
  • @Olaf Yep, you got it spot on. Updates are concentrated in the data for the few latest days. I will partition by a time dimension determined by my aggregation needs. This way, with partition pruning, some queries (that build aggregate sets) may only need to read a subset of the files. That's a really good point Olaf! The only thing is that time series partitioning is a Hive table property, I don't think it exists for an HBASE table. Am I wrong? –  Sep 27 '12 at 18:21
  • I think for the hadoop based solution you will need both - HBase for key-value access and Hive for heavy aggregations. You can use hive or MR to aggregate HBase , but it will be less efficient. – David Gruzman Sep 27 '12 at 18:38
  • 100 dimensions!?! Is there no way to collapse some of the dimensions, even if you have to cross join a few? I would not call a fact table with 100 dimensions a "traditional" star schema ;-) – N West Oct 02 '12 at 17:38
  • @N West, it's just a hypothetical scenario. I wanted to give a sense that the particular case (fact table and dimensions) would be outside of a "traditional" star schema scenario. Most traditional star schemas have nowhere near that number of dimensions (think less than 10, 5 would be just right) because you would not want to drive your report designers crazy (think of the boiling frog story). –  Oct 08 '12 at 01:55

2 Answers2

1

Dimension will be defined as keyrow in HBase. The value is your measure value. If your fact tables are factless, the value in HBase row can be null.

Depends on the poor resources from Internet, I think the idea is:

**RowKey**                                **Value**
DimensionA                             XX
DimensionA:DimensionB                  XX
DimensionB:DimensionC                  XX
DimenesionA:DimensionB:DimenesionC:   XXX

Is it suitable for your problems?

brucenan
  • 584
  • 9
  • 21
0

HBase is not a good choice for a general purpose data warehouse (with real-timish query times) Any single table will only allow you to drill down along one dimension or along one path through dimensions (if you design the right composite key right). It isn't undoable (e.g. ebay built their new search engine on HBase) but it isn't out of the box

There are several efforts to provide high-performance SQL over Hadoop (e.g. Hadapt or Rainstor ) but they won't give you the performance of a good massively parallel databases like Vertica, Greenplum, Asterdata, Netezza and the like

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • The one dimension query limitation means I have to build-in from the start tables with composite keys based on most frequent query conditions. This is a mind shift from the relational way and most relational designers will not choose this path. Basically, you end up with many tables with redundant data throughout in each table having its own composite key. –  Sep 27 '12 at 19:19
  • A related thought is that the way to prioritize table creation in HBase is to take the most frequent queried composite keys that return different sets of resulting columns. Your WHERE clause composite key sets are your HBase keys and your SELECT clause columns are your HBASE columns that are accessed by the composite key dimension above. This turns the relational model on its head. You would basically design for functionality and frequency. –  Sep 27 '12 at 19:21