An Approach to Handle the Multiple Updates on Delta Table in Pyspark by Multiple Users at the same time

Question

I Have a Pypsark notebook that reads the data from SQL table and merge the changes into the Final layer.It works fine when it is triggers by a single user. But throws an error "Concurrent Update on the STG table has failed" when Multiple users triggers this notebook at the same time. Below is the Pyspark code that is being written

    DF = StageData() //To Fetch the Data From Stage tables in SQL DB
    
    DUMMY_TABLE = "DUMMY_TABLE"
    spark.sql("DROP TABLE IF EXISTS "+DUMMY_TABLE)
    DF.write.saveAsTable(DUMMY_TABLE) //Writing the Stage Data to a Temporary table
    
    //Now Merge with the Delta table
    Merge_Query  = ("MERGE INTO delta_table.FACT_TABLE as SQL
                     USING DUMMY_TABLE as STAGE on
                     STAGE.CODE = SQL.CODE
                     when matched Then UPDATE Stmnt
                     when not matched then Insert Stmnt")

When 2 users or more user tries to trigger this notebook at the same time it encounters a Concurrent Update/append issue at the line "DF.write.saveAsTable(DUMMY_TABLE)" How can we achieve a thread.sleep or a retry method in the above case so that no user should fail in running thie code at the same time

Please Advise

What about catch the exception and run the query after a sleep ? You could provide a wrapper as a Python function in the notebook to retry in case of concurrency problem — parisni, Jun 04 '23 at 20:52
@prasini, thanks for your response. Do you have any pseudo code that could give me a better picture — TalendDeveloper, Jun 05 '23 at 05:26

score 0 · Answer 1 · answered Jun 05 '23 at 16:17

I am not aware of a way to auto-retry in such optimistic concurrency context.

What about catching the exception until it works ? You could provide the run_until_success function in the notebook to retry until the query finishes.

# ...

Merge_Query  = ("MERGE INTO delta_table.FACT_TABLE as SQL
                     USING DUMMY_TABLE as STAGE on
                     STAGE.CODE = SQL.CODE
                     when matched Then UPDATE Stmnt
                     when not matched then Insert Stmnt")

def run_until_success(spark, query):
    flag = True
    while flag:
        try:
            spark.sql(query)
            flag = False
        except:
            # you might also add a sleep here
            # or use a Named Error if raised by delta 
            pass

run_until_success(spark, Merge_Query)

An Approach to Handle the Multiple Updates on Delta Table in Pyspark by Multiple Users at the same time

1 Answers1