How to store purchase metadata in a data warehouse

Question

Suppose my company sells many products, eg domains, t-shirts and vans.

In my current data warehouse star schema design, I have a fact table for invoice items with the following (slightly simplified) schema

fact_invoice_item

id                   | pk
invoice_item_id      | id of invoice in OLTP
dim_customer_key     | fk to customer dimension
dim_product_key      | fk to product dimension
dim_billing_date_key | fk to date dimension
dim_due_date_key     | fk to date dimension
invoice_amount       | fact
item_amount          | fact
dd_invoice_id        | degenerate dimension to group together invoice items on the same invoice

I would now like to start recording metadata around these invoice items. For example, if a domain was bought, what was the domain name. If a van was bought, what was the license plate number. If a t-shirt was bought, what was the colour. What is the best way to achieve this while (ideally) adhering to a star / constellation schema?

Current thinking:

Option 1

Have one generic invoice_item_metadata dimension table with a fk to it from the invoice_item table. This dimension table could store the item metadata in json form. Or even just store the purchase metadata in the fact table in json form. This would make things a bit tricky as I'd need to then unpack the json to run any analysis on it.

Option 2

Have a fact table for each type of product bought, eg. fact_domain_purchase and fact_van_purchase. These fact tables could have their own structure to best suit the product metadata. These seems logical but then I start to think that a domain is more of a SCD as it could have attributes like suspended/active/expired which are liable to change over time. This leads me to think that I could have a fact_domain_purchase table with a fk to a dim_domain table, but then the dim_domain table would grow at the same rate as the fact_domain_purchase table, which is undesired.

Does anyone have any bright ideas on how to handle this situation? I'm sure I can't be the first person to tackle this problem, I've found it quite tough getting anything useful back from Google though. Thanks in advance for any help

score 0 · Answer 1 · answered Apr 16 '19 at 17:32

The item properties are ideally described in the product dimension, i.e. you have an extra product_id for each T-shirt size and colour.

If you handle with more unique items (with attributes not completely covered in the product dimension) you add those missing attributes in the fact table.

Either in the main one fact table, which means that it contains attributes for all product subtypes (T-shirt, vans,..) but only those for the selled subtype are filled all other are NULL.

Alternatively (if your landscape is highly heterogeneous) you define a separate fact table for each subtype, related to the main fact table with an optional 1:1 relation. No general rules are valid here, the only possibility is to prototype your solution and see what works and what fails.

What I definitively would not recommend is using the surrogate keys (e.g. dim_billing_date_key) for the time dimension instead of native DATE columns (referencing the time dimension). The fact table is typically partitioned based on the time dimension and the surrogate key may disturb the partition pruning, which is a key to query performance – test carefully in prototype as well.

hi @marmite-bomber thanks for your answer. I appreciate that I should be storing as much as I can in the product dimension. maybe t-shirt colour was a bad example. re: your last paragraph, in the actual schema I do have datetime columns for the actual billing_date and due_date, is that what you are referring to? or am I missing something else? I haven't come across partition pruning before as a concept — dbatten, Apr 16 '19 at 21:49
The point is - you typically need a join to th eproduct dimension to constraint all fact records with a *red T-shirt*, but IMO you should not need a join to the time dimension to select all fact records of a given month. IN other words you shuld be able to constraint you fact table e.g. with `WHERE billing_date >= DATE'2019-01-01'` — Marmite Bomber, Apr 17 '19 at 04:59

score 0 · Answer 2 · answered Apr 16 '19 at 20:25

0

I think you need to tackle two things, where and how to store your metadata

For storage, your use case is perfect example for Extension table

fact_invoice_item_ext

id                   | pk
fact_invoice_item_id | id of fact_invoice_item table

This table is similar to your fact table but it is basically an extension to your fact records to store additional (maybe also optional) data

For how, since you don't know the characteristics of the data, you can create key-value pairs in your table to store info. It would be easier to access than json and more manageable, so these additional columns will store key-value pairs

attr_key             | attribute key; domain, van, t-shirt etc.
attr_value           | attribute value; domain name, license plate etc.

With this approach you can have multiple additional attributes (metadata) for invoice items.

Please let me know if this makes sense or you have additional questions around this concept

answered Apr 16 '19 at 20:25

demircioglu

3,069
1
15
22

hi @mdem7 thanks for your answer. I hadn't come across extension tables as a data warehouse concept before, that certainly seems to fit the problem quite well. Would this be your recommendation in this particular scenario? Or is it quite a widely used solution to this sort of problem? – dbatten Apr 16 '19 at 21:53
Extension tables are widely used in Oracle OBIEE (formerly know as Siebel). When there is a use case that fits, they are another tool to solve the problem at hand. I have used them in various data warehouse models before. I would recommend in this particular scenario because it looks like it'll provide a solution – demircioglu Apr 16 '19 at 22:29
I'd be carefull to use *key - value* schema especially in a data warehouse fact tables. This is a good workaround for *unknown future extensions*, but in *relational database* it has its performance cost. *Prototype and test performance* before a decision.. – Marmite Bomber Apr 17 '19 at 05:03
I partially agree. I have a 9B row table with key-value columns in Redshift, 90% of the queries run under 30 seconds. I guess it depends on your DB how the tables are architected. – demircioglu Apr 17 '19 at 16:26
Oh, I didn't realized the question is *not* tagged with `Oracle`, of course in no SQL databases is key-value modeling fine. My comment is relevant for RDBMS only. – Marmite Bomber Apr 17 '19 at 20:33

How to store purchase metadata in a data warehouse

2 Answers2