1

I have a flatfile resources that were extracted into facts and dimensions. Some dimensions also comes from db resources. The transformation process is set on as needed basis (if there are new/updated from flatfiles). The problem is this, some data reference doesn't exist or match on the dimension based on db resources so the foreign key id value on the fact is set to default (zero if no matching data).

How can i perform an update on the facts if the said dimension (db resource) has been updated? What was the best practice/routine for this kind of scenario?

This is the sample illustration

Flatfile source                           product list (db source)
--------------------------------          ------------------------------
| product name | year | volume |          | prodcode |  name           |
--------------------------------          ------------------------------
| apple        | 2020 |  1000  |          | 001      | apple           |
| watermelon   | 2020 |  2000  |          | 002      | mango           |
--------------------------------          ------------------------------

Fact/Dimension

production_fact                           dim_product
-------------------------------          ---------------------------
| fk_product| fk_date| volume |          | id | prodcode |  name   |
-------------------------------          --------------------------|
| 2         |  d001  |  1000  |          |  1 |  n/a      | n/a    |
| 1         |  d001  |  2000  |          |  2 |  001      | apple  |
-------------------------------          |  3 |  002      | mango  |
                                         ---------------------------

If the product list will be updated (003 watermelon), should i replace the dim_product row#1 with the new value?

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
zysirhc
  • 37
  • 7
  • If a fact arrives that has no matching dimension, you could create a dimension key using the source keys with empty attributes. Then you just update the dimension when it actually arrives. – Nick.Mc May 21 '21 at 10:11
  • @Nick.McDermaid do you have any sample similar to my scenario? i'm trying to figure when will be the trigger part to update and take effect on the fact tables. do i need to re-ingest again the fact data after the update of the dimension? – zysirhc May 21 '21 at 12:43
  • No you don't update the fact. You save a fact with a legitimate dimension record (not a zero). The dimension records happens to have all blank attributes. When the actual dimension appears, it updates the dimension record – Nick.Mc May 21 '21 at 12:44
  • I keep reading your explanation and I can't quite follow it. Perhaps it would be easier if you explained in dot form exactly what happens and in what order and the state of fact and dimension tables – Nick.Mc May 21 '21 at 12:48
  • The source data for both fact and dimension have a common source key right? When you get a fact record with a source key that doesn't already exist in the dimension, you create a dimension record (which of course preserves that source key!). But the dimension has no attributes because it hasn't actually arrived yet. Then you use that surrogate key in the fact.. pointing at a dimension with no attributes filled in. When the dimension does arrive, you simply update the dimension with the correct attributes, because you can match on source keys – Nick.Mc May 21 '21 at 13:08

1 Answers1

1

Based on your example, this is the way it should work:

Note: I would expect prodcode to be be in flatfile, not product name. Is this really how your data looks? Anyway I will proceed.

First set of data arrives. Watermelon is in fact but not dimension.

Flatfile source                           product list (db source)
--------------------------------          ------------------------------
| product name | year | volume |          | prodcode |  name           |
--------------------------------          ------------------------------
| apple        | 2020 |  1000  |          | 001      | apple           |
| watermelon   | 2020 |  2000  |          | 002      | mango           |
--------------------------------          ------------------------------

We load a dimension record but it won't have any attribute values. (As I said I would normally expect the code to be in the fact input data but that's fine we'll go with description). This will of course require some logic to find dimensions that are in fact but not in dimensions.

production_fact                           dim_product
-------------------------------      ------------------------------------------------
| fk_product| fk_date| volume |      | id | prodcode |  name       | weight |colour |
-------------------------------      ------------------------------------------------
| 2         |  d001  |  1000  |      |  1 |  n/a      | n/a        | n/a    | n/a   |
| 4         |  d001  |  2000  |      |  2 |  001      | apple      | 200mg  | red   |
-------------------------------      |  3 |  002      | mango      | 400mg  | yellow|
                                     |  4 |  ?        | watermelon | ?      |   ?   |
                                     ------------------------------------------------

So we have dimension SK 4 which is a legitimate dimension record except it's missing a load of attributes.

Later, the dimension arrives. We know what it's meant to match on so we update the existing dimension which was missing data.

 product list (db source)
-----------------------------------------------
| prodcode |  name           | weight |colour |
--------------------------------------|-------|
| 003      | watermelon      | 1kg    | green |
-----------------------------------------------


------------------------------------------------
| id | prodcode |  name       | weight |colour |
------------------------------------------------
|  1 |  n/a      | n/a        | n/a    | n/a   |
|  2 |  001      | apple      | 200mg  | red   |
|  3 |  002      | mango      | 400mg  | yellow|
|  4 |  003      | watermelon | 1kg    | green |
------------------------------------------------

You want to avoid ever updating large facts. Updating smaller dimensions is a much better idea

BTW this is a type 1 dimension. You can take the same appriach with a SCD except that you wouldn't count the first version of the dimension, you'd just overwrite it.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • Thanks for the illustration sir nick. As per the system owner, the product list is being managed by their in-house application and they define it as the source of truth for product listing. The flatfiles were submitted by their field reporters that is why the product code was not used. So my approach is that more on the conversion and mapping. I only have query scripts for dimension translation and fact building. I haven't build yet for late arriving dimension and facts (possible to happen). – zysirhc May 22 '21 at 12:58
  • 1
    OK that's fine... if you don't have a code that's all there is to it. But hopefully you get the idea from my answer. You create the dimension and join to it, then update the dimension afterwards. Don't update the fact – Nick.Mc May 22 '21 at 13:03
  • Yes sir. i get it. i'll reformulate my query scripts in order to accommodate this scenario. i might as well add a flag to the dimension if it is valid or not. It is also a domino effect on my dimension builder as it is focus on the data construction from db sources. – zysirhc May 22 '21 at 13:23
  • 1
    I agree, a flag is a good idea to indicate this is a known late arriving dim. Good luck! – Nick.Mc May 22 '21 at 13:24