How to delete rows efficiently in sparksql?

Question

I get a view with IDs for which I have to delete the corresponding records in a table present in a database.

View:

|columnName|
------------
|     1    | 
|     2    |
|     3    |
|     4    |

Variables:

tableName = name of the table
columnName = match on this column
database = name of database

Delete logic 1

spark.sql("""DELETE FROM """+database+"""."""+tableName+""" WHERE """+columnName+""" IN ( """+st+""") """)

Here, st has all the unique values, for eg. if columnName is empid, then st with have 1,2,3,etc which I extract from the view using python.

2nd approach

spark.sql("""MERGE INTO database.tablename ... USING VIEW..... WHEN MATCHED THEN DELETE""")

Both of the approaches work for me. Logic 1 is way faster than the merge logic but it still takes 20+ minutes because I have a lot of rows in my tables. Is there a way to improve this to reduce the time taken to perform the delete activities?

EDIT: Full use case. I'm gonna use an example to try and explain the full test case.

TableName: Employee Details
Database Name: emp_db

EmpID | InsertionDate | Field1 | Field2 | Field3 |
1     | 10-10-2014    | 01X    | 2YXM   | ABC    |
1     | 10-10-2014    | 01Y    | 3YXM   | XYZ    |
1     | 10-10-2014    | 01E    | 4YXM   | ABX    |
1     | 10-10-2014    | 01A    | 5YXM   | DEF    |

Now, I will get 2 views, 1 in which I get all the details about the employee and the other with just the list of employees who have changed.

View: global_temp.EmployeeDetails

EmpID | InsertionDate | Field1 | Field2 | Field3 |
1     | 10-12-2014    | 01Y    | 2YXM   | ABC    |
1     | 10-12-2014    | 01P    | 123X   | ABY    |
1     | 10-12-2014    | 01R    | ABY    | LMN    |


2nd View: global_temp.IDS

|EmpID|
-------
1

Here, IDS view only has one record because in our example only 1 ID has updated records.

Now, I need to replace all the records for EmpID 1 in the main table from the view. To achieve this I first delete the entries for ID 1 from my main table and then insert all the entries from the global_temp.employeedetails view.

There is no limit on how many employee ids will get an update so there can be days where there is only 1 change and there can be days where there and 100k changes. So I will not be able to tell what percent of the table is being deleted.

In SQL best practices, when you have to delete more than 10% of your table, it is more efficient to overwrite — Steven, Oct 18 '22 at 08:15
I cannot directly overwrite the table because there can be cases where EmpID 1 is updated and EmpID 2 is not. In this case, the global view will only have details for EmpID 1 and if I overwrite my table, I will lose the details for Emp ID 2. @ZygD — Martin, Oct 18 '22 at 10:12
@Steven There is no way for me to tell what percent of my table is being deleted. There can be days where only 0.002 % is deleted and there can be days where 10% or above is being deleted — Martin, Oct 18 '22 at 10:12
I have made changes to the question the properly explain my use case. — Martin, Oct 18 '22 at 10:14
What about the union of global_temp.EmployeeDetails with [select * from original table where empID not in global_temp.IDS] ? — ZygD, Oct 18 '22 at 10:36
@ZygD, I don't think that would be a good approach. Using this I will be able to easily get all the rows that I require after deletion but it would increase the time taken at other points in the code, for eg, now I will have to first truncate the table, do 2 merge operations in the main table, 1 with this data and another with the new data coming from the view. — Martin, Oct 19 '22 at 11:25
You can use Z-order (https://docs.databricks.com/delta/data-skipping.html) technique to collocate related data in a single file. Next time when you overwrite/delete it may reduce your job execution time. — Muhammad Umar Amanat, Nov 25 '22 at 09:02

partlov · Answer 1 · 2022-10-19T10:25:56.697

0

If EmpID doesn't have huge cardinality and you are able to partition your data by it and also you are able to use Delta Lake 2.0 or newer you may rely on Delta Lake Dynamic Overwrites. This is probably fastest solution you may use.

edited Oct 19 '22 at 10:25

answered Oct 18 '22 at 16:55

partlov

13,789
6
63
82

I have 506k unique employee ids. I don't think it would be a good idea to partition by emp id. What would the drawbacks be if I decide to partition by id anyways? – Martin Oct 19 '22 at 04:33
1

It would not be good idea to partition by that column i that case. You should have much lower cardinality but also at least 1GB of data per partition. These are some best practices. Drawback is that cardinality is so big and data per partition so small it doesn't make sense, you lose all advantages of partition. – partlov Oct 19 '22 at 10:23

How to delete rows efficiently in sparksql?

1 Answers1