Data Warehouse: Working with accumulated data

Question

Our data warehouse uses accumulated data (and there is no way to invert the accumulation) from a data source to create a snowflake schema. A requirement we have to cope with is that our schema must be usable to create reports based on date ranges.

Our schema looks like this (simplified):

+------------------------------------------+
| fact                                     |
+-------+-----------------+----------------+
|    id | statisticsDimId | dateRangeDimId |
+-------+-----------------+----------------+
|     1 |               1 |             10 |
|     2 |               2 |             11 |
|     3 |               3 |             12 |
|     4 |               4 |             13 |
|     5 |               5 |             14 |
|     6 |               5 |             15 |
|     7 |               5 |             16 |
|   ... |             ... |            ... |
| 10001 |            9908 |             11 |
| 10002 |            9909 |             11 |
+-------+-----------------+----------------+

+-------------------------------------------------+
| date_range_dimension                            |
+-------+-----------------------------------------+
|    id | startDateTime      | endDateTime        |
+-------+--------------------+--------------------+
|    10 | '2012-01-01 00:00' | '2012-01-01 23:59' |
|    11 | '2012-01-01 00:00' | '2012-01-02 23:59' |
|    12 | '2012-01-01 00:00' | '2012-01-03 23:59' |
|    13 | '2012-01-01 00:00' | '2012-01-04 23:59' |
|    14 | '2012-01-01 00:00' | '2012-01-05 23:59' |
|    15 | '2012-01-01 00:00' | '2012-01-06 23:59' |
|    16 | '2012-01-01 00:00' | '2012-01-07 23:59' |
|    17 | '2012-01-01 00:00' | '2012-01-08 23:59' |
|    18 | '2012-01-01 00:00' | '2012-01-09 23:59' |
|   ... |                ... |                ... |
+-------+--------------------+--------------------+

+-----------------------------------------------------+
| statistics_dimension                                |
+-------+-------------------+-------------------+-----+
|    id | accumulatedValue1 | accumulatedValue2 | ... |
+-------+-------------------+-------------------+-----+
|     1 |    [not relevant] |    [not relevant] | ... |
|     2 |    [not relevant] |    [not relevant] | ... |
|     3 |    [not relevant] |    [not relevant] | ... |
|     4 |    [not relevant] |    [not relevant] | ... |
|     5 |    [not relevant] |    [not relevant] | ... |
|     6 |    [not relevant] |    [not relevant] | ... |
|     7 |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
| 10001 |    [not relevant] |    [not relevant] | ... |
| 10002 |    [not relevant] |    [not relevant] | ... |
+-------+-------------------+-------------------+-----+

We want to create our report data set with something like this:

SELECT *
    FROM fact
INNER JOIN statistics_dimension
    ON (fact.statisticsDimId = statistics_dimension.id)
INNER JOIN date_range_dimension
    ON (fact.dateDimId = date_range_dimension.id)
WHERE
    date_range_dimension.startDateTime = [start]
AND
    date_range_dimension.endDateTime = [end]

The problem is that the data in our statistics dimension is already accumulated and we cannot invert the accumulation. We calculated the approximated number of rows in our fact table and got 5,250,137,022,180. There are about 2,5 million date range permutations for our data and we need to calculate them into our date dimension and fact table because of the accumulation. SQL's SUM function does not work for us due to the accumulation (you cannot add two values that belong to non-distinct sets).

Is there a best practice we could follow to make it computationally feasible? Is there something wrong with our schema design?

We need to report data about online trainings. The data source is a legacy data provider with parts that are older than 10 years - so nobody can reconstruct the internal logic. The statistics dimension contains - for example - the progress (in %) a user accomplished in a web-based training (WBT), the number of calls per WBT page, the status of a WBT (for a user, e.g. "completed"), a.s.o.. The important thing about the data provider is: It just gives us a snapshot of the current state. We don't have access to historic data.

Can you please add some business detail behind your data? What is the question you are trying to answer (in business terms, not SQL). I think this would be really helpful to better understand your situation. — Tomas Greif, Dec 17 '12 at 12:24
In a typical fact table, the measures are actually *on* the table, not in another dimension. This will slow you down considerably. — N West, Dec 17 '12 at 13:36

score 2 · Accepted Answer · answered Dec 17 '12 at 13:56

I'm assuming you are on some pretty strong hardware for this. Your design has one major drawback - the join between the fact table and the "statistics" dimension.

Generally, a fact table contains dimensions and measures. It looks to me like it's likely there's a 1-1 relationship between your "statistics" dimension and your fact table. Since fact tables are essentially a "Many-Many" relationship table, it doesn't make sense to have your stats on a separate table. In addition, you say the stats table has information "by user".

Any time you say "By X" in warehousing, you can almost always be sure that X should be a dimension.

I would see about building your fact table with the measures directly on it. I'm not sure what you're trying to do with "inverting" the accumulation on the stats table? Do you mean it is accumulated across date ranges? Users? If the data's not atomic, the best you can do is give what you have...

Well, okay. Even if we move the statistics into the fact table we still have huge amounts of data due to the date ranges. — Alexander Müller, Dec 17 '12 at 14:49
If I understand your data correctly, you are given a measure that is taken over a certain period of time, right? If so, then you're going to have a very large amount of data. It may be helpful if you posted an example of the data that is being sent to you, so we can better understand what you mean by the data is already accumulated. — N West, Dec 17 '12 at 17:25

score 1 · Answer 2 · answered Dec 18 '12 at 07:39

You can reduce the number of dimensions needed to calculate this task by:

adding time dimension with daily granularity and not using your current design
merging statistics dimension with fact table

In our current data warehouse we are using the following approach:

time_dimension
 time_key (bigint)
 time_date (date)
 (other time related columns)

fact_table
 (keys to other dimensions)
 time_key_start (bigint) /* reference to time_dimension, time_key */
 time_key_end (bigint)   /* reference to time_dimension, time_key */
 value_1
 value_2

Furthermore, keys in time_dimension are "smart". I know lot of people won't agree with such design, but when performance has to be improved, we can reduce number of dimensions used in query by querying time_key directly, with condition like:

time_key_start = to_char('2012-01-01','J')::bigint
and
time_key_end = to_char('2012-01-02','J')::bigint

With such design you can avoid all joins in your query. Then you have to focus on table partitions and indexes to improve performance.

Maybe, it is also not needed to analyze the whole history of data and you can move some data to archive.

Data Warehouse: Working with accumulated data

2 Answers2