I am looking to find the best way to update an existing delta table from a dataframe.
As an example, I have the following delta table:
Id | ROUTE_ID | ORIGIN | DEST | CARIER | RANK | CARGO_TYPE | TRANIST_T | FREQ | VALID_FROM | VALID_TO | SUBMISSION_DATE |
---|---|---|---|---|---|---|---|---|---|---|---|
25838 | 1 | CPH | DUB | SAS | 1 | G | 3 | 7 | 2022-10-01 | 2023-01-30 | 2020-03-13 |
25839 | 2 | CPH | DUB | SAS | 2 | G | 3 | 7 | 2022-10-01 | 2023-01-30 | 2020-03-13 |
and here is the new dataframe rows:
ORIGIN | DEST | CARIER | RANK | CARGO_TYPE | TRANIST_T | FREQ | VALID_FROM | VALID_TO | SUBMISSION_DATE |
---|---|---|---|---|---|---|---|---|---|
CPH | DUB | SAS | 1 | G | 3 | 7 | 2023-02-01 | 2023-06-30 | 2023-02-01 |
CPH | DUB | SAS | 2 | G | 3 | 7 | 2023-02-01 | 2023-06-30 | 2023-02-01 |
The logic is to update existing rows in the delta table based on the following, so the matching rows in the delta table are updated (VALID_TO column) or not depending if they meet the logic, else the row rows are added to the delta table.
def update(row,df,valid_from_max):
try:
# find matching rows from the delta table
result = df.where(((df['ORIGIN'] == row['ORIGIN']) & (df['DEST'] == row['DEST']) & (df['CARIER'] == row['CARIER']) & (df['CARGO_TYPE'] == row['CARGO_TYPE']) & (df['VALID_FROM'] == valid_from_max)
if result.count() == 0:
# if no match is found add the new row to the insert list
insert_list.append(row.to_frame().T)
else:
# set the date vars
new_valid_from = row['VALID_FROM']
new_valid_to = row['VALID_TO']
last_valid_from = result.agg({'VALID_FROM':'max'}).collect()[0][0]
last_valid_to = result.agg({'VALID_TO':'max'}).collect()[0][0]
# checks for invalid conditions
if new_valid_to <= new_valid_from:
print('To Date Less Than From Date')
elif new_valid_from < last_valid_from:
print('New Can Not Start Earlier')
elif (new_valid_from > last_valid_from) & (new_valid_to < last_valid_to):
print('New Can Not Contained Within Last')
# checks for valid conditions
elif new_valid_from >= last_valid_to:
print("newValidFrom >= lastValidTo")
insert_list.append(row.to_frame().T)
elif new_valid_from > last_valid_from:
print("newValidFrom > lastValidFrom")
update_date = new_valid_from - relativedelta(days=1)
result['VALID_TO'] = update_date
result = result.withColumn("VALID_TO", F.lit(update_date))
update_list.append(result)
insert_list.append(row.to_frame().T)
else:
print("Unknown Record Combination")
except Exception as e:
print(e)
So the end results would be something like the following:
Id | ROUTE_ID | ORIGIN | DEST | CARIER | RANK | CARGO_TYPE | TRANIST_T | FREQ | VALID_FROM | VALID_TO | SUBMISSION_DATE |
---|---|---|---|---|---|---|---|---|---|---|---|
25838 | 1 | CPH | DUB | SAS | 1 | G | 3 | 7 | 2022-10-01 | 2023-01-30 | 2020-03-13 |
25839 | 2 | CPH | DUB | SAS | 2 | G | 3 | 7 | 2022-10-01 | 2023-01-30 | 2020-03-13 |
0 | 1 | CPH | DUB | SAS | 1 | G | 4 | 7 | 2023-02-01 | 2023-06-30 | 2023-02-01 |
0 | 2 | CPH | DUB | EIR | 2 | G | 3 | 6 | 2023-02-01 | 2023-06-30 | 2023-02-01 |
So my initial thoughts is to have a temp insert and update list which are populated as part of the update function.
update_df = reduce(DataFrame.unionAll, update_list)
insert_df = pd.concat(insert_list)
update_df = update_df.drop_duplicates()
I would then apply merge and apply from the update_df and insert_df to the delta table.
Any suggestions or tips if I am going in the right direction or there is a better way to apply the updates to the delta table based on the needed logic
Cheers