Our data warehouse uses accumulated data (and there is no way to invert the accumulation) from a data source to create a snowflake schema. A requirement we have to cope with is that our schema must be usable to create reports based on date ranges.
Our schema looks like this (simplified):
+------------------------------------------+
| fact |
+-------+-----------------+----------------+
| id | statisticsDimId | dateRangeDimId |
+-------+-----------------+----------------+
| 1 | 1 | 10 |
| 2 | 2 | 11 |
| 3 | 3 | 12 |
| 4 | 4 | 13 |
| 5 | 5 | 14 |
| 6 | 5 | 15 |
| 7 | 5 | 16 |
| ... | ... | ... |
| 10001 | 9908 | 11 |
| 10002 | 9909 | 11 |
+-------+-----------------+----------------+
+-------------------------------------------------+
| date_range_dimension |
+-------+-----------------------------------------+
| id | startDateTime | endDateTime |
+-------+--------------------+--------------------+
| 10 | '2012-01-01 00:00' | '2012-01-01 23:59' |
| 11 | '2012-01-01 00:00' | '2012-01-02 23:59' |
| 12 | '2012-01-01 00:00' | '2012-01-03 23:59' |
| 13 | '2012-01-01 00:00' | '2012-01-04 23:59' |
| 14 | '2012-01-01 00:00' | '2012-01-05 23:59' |
| 15 | '2012-01-01 00:00' | '2012-01-06 23:59' |
| 16 | '2012-01-01 00:00' | '2012-01-07 23:59' |
| 17 | '2012-01-01 00:00' | '2012-01-08 23:59' |
| 18 | '2012-01-01 00:00' | '2012-01-09 23:59' |
| ... | ... | ... |
+-------+--------------------+--------------------+
+-----------------------------------------------------+
| statistics_dimension |
+-------+-------------------+-------------------+-----+
| id | accumulatedValue1 | accumulatedValue2 | ... |
+-------+-------------------+-------------------+-----+
| 1 | [not relevant] | [not relevant] | ... |
| 2 | [not relevant] | [not relevant] | ... |
| 3 | [not relevant] | [not relevant] | ... |
| 4 | [not relevant] | [not relevant] | ... |
| 5 | [not relevant] | [not relevant] | ... |
| 6 | [not relevant] | [not relevant] | ... |
| 7 | [not relevant] | [not relevant] | ... |
| ... | [not relevant] | [not relevant] | ... |
| ... | [not relevant] | [not relevant] | ... |
| 10001 | [not relevant] | [not relevant] | ... |
| 10002 | [not relevant] | [not relevant] | ... |
+-------+-------------------+-------------------+-----+
We want to create our report data set with something like this:
SELECT *
FROM fact
INNER JOIN statistics_dimension
ON (fact.statisticsDimId = statistics_dimension.id)
INNER JOIN date_range_dimension
ON (fact.dateDimId = date_range_dimension.id)
WHERE
date_range_dimension.startDateTime = [start]
AND
date_range_dimension.endDateTime = [end]
The problem is that the data in our statistics dimension is already accumulated and we cannot invert the accumulation. We calculated the approximated number of rows in our fact table and got 5,250,137,022,180. There are about 2,5 million date range permutations for our data and we need to calculate them into our date dimension and fact table because of the accumulation. SQL's SUM function does not work for us due to the accumulation (you cannot add two values that belong to non-distinct sets).
Is there a best practice we could follow to make it computationally feasible? Is there something wrong with our schema design?
We need to report data about online trainings. The data source is a legacy data provider with parts that are older than 10 years - so nobody can reconstruct the internal logic. The statistics dimension contains - for example - the progress (in %) a user accomplished in a web-based training (WBT), the number of calls per WBT page, the status of a WBT (for a user, e.g. "completed"), a.s.o.. The important thing about the data provider is: It just gives us a snapshot of the current state. We don't have access to historic data.