0

I have the use case which needs to update the same rows in a postgresql table from four spark jobs at the same time. The spark jobs are using 'executeUpdate()' from JDBC to update the same table. The spark jobs are streaming jobs, which needs to store data continously.

I am thinking to rely on Row based Lock or Table based Lock to solve the deadlock issues. Is my following assumption correct?

Row based Lock

In each spark job, I add 'BEGIN' and 'COMMIT' around the sql statement to run in executeUpdate. Then, I add 'SELECT xxx FOR UPDATE' to explicitly lock the rows for impacted rows in each spark job. In my understanding, the next spark job will automatically wait the Row based lock to be released to execute its Update statement for the impacted rows from the previous job.

Table based Lock

In each spark job, I added 'BEGIN', 'LOCK TABLE table', 'COMMIT' around the sql statement to run in executeUpdate. In my understanding, the explicit 'LOCK' command will lock the whole table during its transaction. When the second spark job tries to lock the table, it will wait until the first spark job to release the lock.

Are there above approaches working for the deadlock issues? Since spark jobs are streaming jobs, I prefer to the row based one, which can save waiting time. Are there some better solutions? Thanks!

yyuankm
  • 295
  • 4
  • 22
  • 1
    The first approach won't work, it will deadlock just as well. Locking the table will work, but it is bad for performance, and if you choose the wrong lock strength, it will break your database. If you tell us more about how these jobs work and how they identify the rows to change, a better solution can be found. – Laurenz Albe Mar 22 '21 at 04:59
  • Thanks. Locking the table does work, but I am worried about the performance. We have 4 spark jobs to update the same table. Let's say this table has columns a, b, c, d, e. The first 2 jobs are batch job, which is running daily. Both of them are doing upsert from the staging table into this table. job1 to upsert a, b; and job2 to upsert a, c. Another two jobs are streaming jobs, which will ingest data into a separate staging table and update the table every 2 minutes. job3 to update a, d and job4 to update a, c and f. Any comments for better locking design are welcome! – yyuankm Mar 22 '21 at 09:05
  • 1
    The smaller the transactions, the less the risk of deadlock. If deadlocks happen rarely, they are no big problem. – Laurenz Albe Mar 22 '21 at 09:52
  • Great idea. I will try to break the big transaction into small ones and set lock in each transaction. It should both reduce the risk of deadlock and improve the performance. Thanks! – yyuankm Mar 22 '21 at 10:26
  • If each transaction only updates one row, there shouldn't be any deadlocks. Do your jobs need to update multiple rows in the same transaction? – jjanes Mar 22 '21 at 20:48
  • In your description in the comment, how is the row identified? You only mention 5 columns, and list each one as getting updated. Is there not a primary key which doesn't get updated? How does the jobs know what they are supposed to be doing? Is there a coordinator that tells them? Are they fetching their todo list from the same database (same table?) as the one they are updating? – jjanes Mar 22 '21 at 20:51
  • Thanks jjanes, for each transaction of each spark job, it will update several rows in the same table. It is the reason that the deadlock may happen. The rows will be identified by one primary key for all spark jobs. But all of them will not update the primary key. They just update the other fields. The spark jobs will first ingest data from the data sources and store them into the corresponding staging table. Then, it will refer to the staging table to update the relevant rows in the main table shared by all spark jobs. Hope it can help understand the scenario. Any suggestions are welcome! – yyuankm Mar 23 '21 at 22:31

0 Answers0