0

Our source system data is not a typical sales transaction table that references product and customer business keys at the transaction record. There are one or two tables in between before I can get to the customer or product information. When I load data into staging tables in DW, I plan to prepare the data (by performing joins on all the tables in between transaction and customer/product stored in the data lake using spark) to have the customer and product business keys. Is this the right approach? I don't want to perform these joins on the staging tables. Instead, I want to load prepared data with all the context (associated business keys). I will replace with replace these keys with surrogate keys when loading them to the target fact table.

  • 1
    Generally I would recommend not transforming the source - instead replicate it and do transformation after that. Otherwise it gets really difficult to troubleshoot transformation issues. If you have _all_ of your working in work tables you can follow the workings through and find issues. Also, for example if you pre-transform data you need to come up with a intermediate data model along with required maintenance. i.e. Do you join lookups to transactions? Do you include keys with lookup labels? If you miss a required field you need to do all of your intermediate transformations again. – Nick.Mc Aug 05 '21 at 08:36
  • ... and databricks as a transformation tool for a DW. I don't really buy it. I really don't see what databricks offers if all of your sources and targets are relational databases. – Nick.Mc Aug 05 '21 at 08:37
  • Could you explain what do you mean work tables? Is it staging tables within DW? –  Aug 05 '21 at 08:56
  • I replicate the source data in Data Lake. Is it okay to transform once we copy of raw data in Data Lake? "If you miss a required field you need to do all of your intermediate transformations again" Do you mean for early arriving facts?. I am using Databricks to prepare data in data lake before it can be uploaded to staging tables in DW. –  Aug 05 '21 at 09:00
  • My typical data flow = RDS --> Data Lake (Raw) --> Spark Job --> Data Lake (Prep) --> Upload to Staging Tables in DW --> Update target fact and dimension tables using staging tables. –  Aug 05 '21 at 09:03
  • @Nick.McDermaid let me know your thoughts? –  Aug 05 '21 at 09:10
  • 1
    work tables / staging tables - yes the same thing. RDS = Amazon Relational Database Service? I see this all the time: what value is data lake + spark adding here? It's adding to complexity, but is there actually a need for it? Why not just load straight from RDS into staging tables and do the work in the database? Going through a non relational layer just adds complexity in my opinion.... you lose data types etc. – Nick.Mc Aug 05 '21 at 12:40
  • Reasons for the data lake. 1. We are not pulling data in batch from DB; we are using WAL to load data incrementally into Data Lake to reduce the load on the DB. 2. It also gives us the ability to store data from all the tables we don't use in DW in Data lake (low storage cost) but used by ML models directly from the data lake. We want to separate storage from computing as much as possible. It makes sense to move data from RDS to DW directly, but w.r.t other clickstream data (coming from other sources, it looks like a log of JSON), we don't want to load all entire data in DW, so we want to.. –  Aug 05 '21 at 16:00
  • ..use this as buffer and only take relevant data into DW. Let me know if there is anything wrong with this approach? I think you made good point w.r.t to losing data types, I have never thought about it. RDS is Relational Database service @Nick.McDermaid –  Aug 05 '21 at 16:00
  • 1
    That is definitely a bunch of good reasons to use data lake. Anyway I think this question is probably opinion based. All I can say is: make sure you can audit/troubleshoot data flow easily. If you can do that then you approach is probably fine – Nick.Mc Aug 05 '21 at 23:38
  • Thank you so much for the feedback! @Nick.McDermaid –  Aug 06 '21 at 04:11

0 Answers0