0

I am looking to find the best way to update an existing delta table from a dataframe.

As an example, I have the following delta table:

Id ROUTE_ID ORIGIN DEST CARIER RANK CARGO_TYPE TRANIST_T FREQ VALID_FROM VALID_TO SUBMISSION_DATE
25838 1 CPH DUB SAS 1 G 3 7 2022-10-01 2023-01-30 2020-03-13
25839 2 CPH DUB SAS 2 G 3 7 2022-10-01 2023-01-30 2020-03-13

and here is the new dataframe rows:

ORIGIN DEST CARIER RANK CARGO_TYPE TRANIST_T FREQ VALID_FROM VALID_TO SUBMISSION_DATE
CPH DUB SAS 1 G 3 7 2023-02-01 2023-06-30 2023-02-01
CPH DUB SAS 2 G 3 7 2023-02-01 2023-06-30 2023-02-01

The logic is to update existing rows in the delta table based on the following, so the matching rows in the delta table are updated (VALID_TO column) or not depending if they meet the logic, else the row rows are added to the delta table.

def update(row,df,valid_from_max):

    try:
        # find matching rows from the delta table
        result = df.where(((df['ORIGIN'] == row['ORIGIN']) & (df['DEST'] == row['DEST']) & (df['CARIER'] == row['CARIER']) & (df['CARGO_TYPE'] == row['CARGO_TYPE']) & (df['VALID_FROM'] == valid_from_max)

        if result.count() == 0:
            # if no match is found add the new row to the insert list
            insert_list.append(row.to_frame().T)

        else:
            # set the date vars
            new_valid_from = row['VALID_FROM']
            new_valid_to = row['VALID_TO']
            last_valid_from = result.agg({'VALID_FROM':'max'}).collect()[0][0]
            last_valid_to = result.agg({'VALID_TO':'max'}).collect()[0][0]

            # checks for invalid conditions
            if new_valid_to <= new_valid_from:

                print('To Date Less Than From Date')

            elif new_valid_from < last_valid_from:

                print('New Can Not Start Earlier')

            elif (new_valid_from > last_valid_from) & (new_valid_to < last_valid_to):

                print('New Can Not Contained Within Last')

            # checks for valid conditions

            elif new_valid_from >= last_valid_to:
                print("newValidFrom >= lastValidTo")
        insert_list.append(row.to_frame().T)

            elif new_valid_from > last_valid_from:
                print("newValidFrom > lastValidFrom")
              
                update_date = new_valid_from - relativedelta(days=1)
                
                result['VALID_TO'] = update_date

                result = result.withColumn("VALID_TO", F.lit(update_date))
  
                update_list.append(result)
                insert_list.append(row.to_frame().T)

            else:
                print("Unknown Record Combination")

    except Exception as e:
          print(e)

So the end results would be something like the following:

Id ROUTE_ID ORIGIN DEST CARIER RANK CARGO_TYPE TRANIST_T FREQ VALID_FROM VALID_TO SUBMISSION_DATE
25838 1 CPH DUB SAS 1 G 3 7 2022-10-01 2023-01-30 2020-03-13
25839 2 CPH DUB SAS 2 G 3 7 2022-10-01 2023-01-30 2020-03-13
0 1 CPH DUB SAS 1 G 4 7 2023-02-01 2023-06-30 2023-02-01
0 2 CPH DUB EIR 2 G 3 6 2023-02-01 2023-06-30 2023-02-01

So my initial thoughts is to have a temp insert and update list which are populated as part of the update function.

update_df = reduce(DataFrame.unionAll, update_list)
insert_df = pd.concat(insert_list)
update_df = update_df.drop_duplicates()

I would then apply merge and apply from the update_df and insert_df to the delta table.

Any suggestions or tips if I am going in the right direction or there is a better way to apply the updates to the delta table based on the needed logic

Cheers

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
FizzyGood
  • 33
  • 8
  • 2
    Have you looked onto MERGE for Delta? https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge – Alex Ott Jan 15 '23 at 13:40
  • please bear with a novice here :) but as part of the merge operation i could potentially use the following to replicate the logic in the function? 1.Each whenMatched clause can have an optional condition. If this clause condition exists, the update or delete action is executed for any matching source-target row pair only when the clause condition is true. 2.If there are multiple whenMatched clauses, then they are evaluated in the order they are specified. All whenMatched clauses, except the last one, must have conditions. – FizzyGood Jan 15 '23 at 14:48
  • 2
    yes. you're completely correct in description how it works – Alex Ott Jan 15 '23 at 15:18

0 Answers0