0

I have an ETL process I'm building and I'm trying to figure out the best way to build it. For context, i'm working with retail orders related data at a distribution center, so these orders start with an "open" status, meaning orders that need to be fulfilled, "WIP" meaning orders in process and then "shipped" meaning the order is complete and on a truck out of the DC.

I'm building an ETL process to create a Fact table that will have all order related information. The issue I'm having is that the primary key between the two tables (Table A, Table B) I'm trying to join changes once the order is "shipped". Example, Table A is the main transaction table, and Table B is the "task" information table. So once the order is "shipped" in table A, the primary key of Table B changes (it is called a sequence number, some WMS system update causes the sequence number to change) and then I can't join table B anymore. So basically, I can only track the more detailed Table B (Tasks) related information along side the rest of the transaction information as long as the order hasn't shipped. I can't figure out how best to go about this. Do I create a process that doesn't update the primary key in my staging tables? if so, how I would I do that? I'm just kind of stuck and am looking for some advice/guidance on this. Appreciate any help!

I tried making a copy of the column with the idea being that I'll prevent it from updating, but the issue is that the next day when I pull the data from the DW, the primary key's stored there are the updated ones, so basically I don't get to have more than 1 day of visibility to the joined data. I'm thought of many ways but nothing's working out.

enter image description here

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
  • The question is great but hard to follow. The current db schema will be very useful. – Kate Melnykova May 19 '23 at 16:02
  • i added a link to a picture that shows the 2 tables and the primary key's I mentioned. Basically, the "Carton Seq" column in the processing table is the one that changes. need to find a way to create my own. – Umair Ahmed May 19 '23 at 16:30
  • Are you sure that table B is the actual table to use for this? Perhaps the ERD of the OLTP system would be useful. Not sure why the architect would make a PK an updated sequential id from a system perspective.. Also, one of the reasons to create a DWH is to address the issue of source systems lacking a historical / time series component. – Jatin Morar May 19 '23 at 18:05
  • i know, thats the issue. the data engineers in our company tend to just replicate the data from sources and put into the DW, but not actually create proper data structures. Thats why i'm building out a datamart to address that, but this is one of the major issues i'm encountering. – Umair Ahmed May 19 '23 at 19:31
  • Why not create a Type 2 SCD off Table B displaying the change in order status. Though if you go this route your process will be subject to daily run failures, therefore this will need some solid event/error handling. – Jatin Morar May 19 '23 at 19:56
  • Also, to circle up with my initial comment, I do not think `CARTON_SEQ` is a PK based on that image and what looks to be a DBeaver Diagram. Which again leads me to believe your understanding of the relational data model may not be inaccurate. My point is you may want to revisit your initial idea. Reverse engineering can be a fun challenge. – Jatin Morar May 19 '23 at 20:07
  • you're right, its not actually a PK. there were no PK's built in the first place by the data engineering teams. But that's the only unique column, which, even after changing remains unique. I've spoken with the SME's on this data and they told me yeah sorry you may not be able to get that data. So i'm trying to figure out my own way. – Umair Ahmed May 23 '23 at 16:51
  • I'll look into type 2 SCD, as that's something I've never done but it sounds like a possible solution. If I can track each change, I can manipulate data to get what I want. thanks for that option – Umair Ahmed May 23 '23 at 16:53

0 Answers0