1

One of our data sources sends a feed with an aggregate of data per day. A periodic snapshot. For example:

shop,       day,        sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 100

I know of two ways to model this in a data vault raw vault:

Multi-Active Satellite

Here we allow each satellite to have multiple rows per hub key.

create table dbo.HubShop (
    ShopName nvarchar(50) not null,
    primary key pk_HubShop (ShopName)
)

create table dbo.SatDailyShopSales (
    ShopName nvarchar(50) not null,
    SalesDate date not null,
    SalesAmount money not null,
    LoadTimestamp datetime2(7) not null,

    primary key pk_SatDailyShopSales (ShopName, SalesDate, LoadTimestamp)
)

This is easy to implement but we now have a bi-temporal element to the satellite.

Snapshot Hub

create table dbo.HubShop (
    ShopName nvarchar(50) not null,
    primary key pk_HubShop (ShopName)
)

create table dbo.HubSnapshot (
    SalesDate date not null,
    primary key pk_HubSnapshot  (SalesDate)
)

create table dbo.LinkDailyShopSnapshot (
    LinkDailyShopSnapshotHash binary(32) not null,
    ShopName nvarchar(50) not null,
    SalesDate date not null,

    primary key pk_LinkDailyShopSnapshot  (LinkDailyShopSnapshotHash)
)

create table dbo.SatDailyShopSales (
    LinkDailyShopSnapshotHash binary(32) not null,

    SalesAmount money not null,
    LoadTimestamp datetime2(7) not null,

    primary key pk_SatDailyShopSales (LinkDailyShopSnapshotHash, LoadTimestamp)
)

This second solution adds an extra hub which just stores a list of dates and a link for the intersection between date and shop.

The second solution feels cleaner but requires more joins.

Which is the correct model? Are there any better solutions?

James
  • 433
  • 4
  • 14
  • Not the second solution. Data vault is about business keys, Snapshot is not one of them, it's only how the relations between hubs are captured. – imel96 Dec 01 '21 at 05:45

1 Answers1

2

as far as my understanding of the Data Vault modelling approach goes the Satellites are there to store the accurate time-slices of your data-warehouse. This means that if i am given a specific date and i select all hubs, links (with no or enddate <= specific date). And then their corresponding entry with max(loaddate) & loaddate <= specific date, i should have the full representation of the current real world data state.

Applied to your question this means that your second solution fits these requirements. Because you can still import "changes" in the source system as new time slices, therefore modeling the correct timeline of information in the dwh.

To formulate it as an example, lets say you source system has the state:

shop,       day,        sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 100

and you import this data on 2019-01-03 23:30:00. On Jannuary the 4th 12:10:00 though "janes couts" salesteam corrects the numbers to only 90 sales. In your first solution this leaves you with updating the satellite entry with hub key "janes coats" and loaddate "2019-01-03" to 90 effectively loosing your accurate dwh history.

so your DWH only stores the following afterwards:

shop,       day,        sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 90

whereas in your second solution you simply insert a new satellite timeslice for store snapshot hash (for business key "janes coats" with date"2019-01-03") with loaddate "2019-01-03 12:10:00" and sales 90.

LINK
shop,       day,        ID (think of ID as a hash)
bobs socks, 2019-01-01, 1
bobs socks, 2019-01-02, 2
bobs socks, 2019-01-03, 3
janes coats,2019-01-01, 4
janes coats,2019-01-02, 5
janes coats,2019-01-03, 6

SALES Satellite
Link ID, loaddate,            sales
1,       2019-01-03 23:30:00, 45
2,       2019-01-03 23:30:00, 50
3,       2019-01-03 23:30:00, 10
4,       2019-01-03 23:30:00, 500
5,       2019-01-03 23:30:00, 55
6,       2019-01-03 23:30:00, 100   !
6,       2019-01-04 12:10:00, 90    !

So you can easily see in your system that you got the correction of sales numbers at 2019-01-04 12:10:00 and that they were 100 before that.

The way I think of it is the only allowed update action in the Data Vault model is setting an EndDate in a Link Table and that deletes are never allowed. The you have a full DWH history available and reproduceable.

FlorianB
  • 116
  • 1
  • 4