Star Schema and Weighted Averages/Summarised Data

Question

What is the appropriate way to handle data that is, by virtue of its source, rolled-up or otherwise summarised and weighted ahead of time (min, max, avg for the summarisation)?

I cannot change that behaviour, so while "Do not summarise the data" is unfortunately not going to fly.

The data would look like this:

Data that is less than a week old: every 5 minutes, weight = 5
Data that is between a week and a month old: summarised into hourly data, weight = 60
Data that is older than that: summarised into daily data, weight = 60

table, th, td {
  border: 1px solid;
}

td {
  font-family: monospace;
}

<table>
    <tr>
        <th>timestamp
        <th>min
        <th>max
        <th>avg
        <th>weight
     <tr>
        <td>2022-05-18 00:10:00
        <td>1
        <td>10
        <td>3
        <td>5
     <tr>
        <td>2022-05-18 00:05:00
        <td>2
        <td>6
        <td>4
        <td>5
    <tr>
        <td>2022-05-18 00:00:00
        <td>9
        <td>93
        <td>44
        <td>5
    <tr>
        <td>...
    <tr>
        <td>2022-05-11 02:00:00
        <td>1
        <td>1
        <td>1
        <td>60
    <tr>
        <td>2022-05-11 01:00:00
        <td>7
        <td>94
        <td>23
        <td>60
    <tr>
        <td>2022-05-11 00:00:00
        <td>0
        <td>0
        <td>0
        <td>60
    <tr>
        <td>...
    <tr>
        <td>2022-04-18 00:00:00
        <td>1
        <td>100
        <td>50
        <td>1440
    <tr>
        <td>2022-04-17 00:00:00
        <td>10
        <td>10
        <td>10
        <td>1440
    <tr>
        <td>2022-04-16 00:00:00
        <td>0
        <td>15
        <td>10
        <td>1440

Everything says "do not mix granularity" but then has a caveat of "you could use a field indicating weight". Ought one store the data for the different timeframes in a single fact table with the weights, or in multiple fact tables?

The questions that will be asked will tend to cross the boundaries of the different bands in terms of timeframe.

score 0 · Answer 1 · answered May 18 '22 at 10:41

"do not mix granularity" applies. If facts are not at the same grain they cannot be in the same fact table and therefore you would need to create multiple fact tables.

You would also need to consider how to handle the "same" data that is sourced at different levels of aggregation. Say you had store data at the day grain from some stores and the same data, but at the week grain, from other stores. In order to be able to report across all stores (and this could not be more granular than the week grain) you would need to aggregate the day grain data up to the week level and then you have a choice:

Merge this data into the same table that holds the data being sourced at the week grain
Hold this aggregated data in a separate table and, potentially, put a view across both this aggregated data and the "sourced at week grain" table to present them as a single dataset

My comment seems to have disappeared. Thanks for the this! Can you speak to the implications of the weight + fact table all in one approach? Is it mainly a matter of performance/convenience (due to having to write our own measures for things like count and average, or standard deviations etc), gappy data, will the tools make weird joins, or are there other deeper implications that one would not be immediately aware of? — localusername, May 19 '22 at 11:13
One of the major benefits of a properly designed dimensional model is that is is simple, and unambiguous, to query - this means the complicated logic is built into the data loading processes and not into the data querying processes. — NickW, May 19 '22 at 15:59

Star Schema and Weighted Averages/Summarised Data

1 Answers1