sum and distinct-count measures (star schema design koan)

Question

I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:

Dimension Gazetteer:
dimension_id
country_name
province_name
district_name

Dimension Device:
dimension_id
device_category
device_subcategory

Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)

A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:

+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id             | area_m2   |
+--------------+---------------------+-----------------------+-----------+
| 123          | 321                 | 0a0a-502c-11aa1331e98 | 6000      |
+--------------+---------------------+-----------------------+-----------+
| 123          | 654                 | 0a0a-502c-11aa1331e98 | 6000      |
+--------------+---------------------+-----------------------+-----------+
| 123          | 987                 | 0a0a-502c-11aa1331e98 | 6000      |
+--------------+---------------------+-----------------------+-----------+

I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?

I am using the Pentaho stack.

Thanks in advance

If a hazard-id is a minefield, and you're looking at mines-per-gazetter & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0. — Jamie, Apr 23 '12 at 23:57

score 1 · Answer 1 · answered Apr 24 '12 at 01:22

[moved from comment]

If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.

If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.

I would also consider adding a device-count fact, which could have the num devices of each type per hazard.

Thanks Jamie for response. Option (1) Make a Hazard Dimension: I believe it would be difficult to use because there would be many single entries for Area (dozens of thousands) without hierarchy. Option (2) with null device is a good idea, but you're right with your example, I don't know how to implement that, at least in pentaho. The option (3) addicional number-of-devices-per-field-count column in the fact table: I was thinking of it. It could serve as back-divide the total area but won't work with a query where multiple devices requested... — elkarel, Apr 25 '12 at 12:34

sum and distinct-count measures (star schema design koan)

1 Answers1