I get a view with IDs for which I have to delete the corresponding records in a table present in a database.
View:
|columnName|
------------
| 1 |
| 2 |
| 3 |
| 4 |
Variables:
tableName = name of the table
columnName = match on this column
database = name of database
Delete logic 1
spark.sql("""DELETE FROM """+database+"""."""+tableName+""" WHERE """+columnName+""" IN ( """+st+""") """)
Here, st
has all the unique values, for eg. if columnName
is empid
, then st with have 1,2,3,etc
which I extract from the view using python.
2nd approach
spark.sql("""MERGE INTO database.tablename ... USING VIEW..... WHEN MATCHED THEN DELETE""")
Both of the approaches work for me. Logic 1 is way faster than the merge logic but it still takes 20+ minutes because I have a lot of rows in my tables. Is there a way to improve this to reduce the time taken to perform the delete activities?
EDIT: Full use case. I'm gonna use an example to try and explain the full test case.
TableName: Employee Details
Database Name: emp_db
EmpID | InsertionDate | Field1 | Field2 | Field3 |
1 | 10-10-2014 | 01X | 2YXM | ABC |
1 | 10-10-2014 | 01Y | 3YXM | XYZ |
1 | 10-10-2014 | 01E | 4YXM | ABX |
1 | 10-10-2014 | 01A | 5YXM | DEF |
Now, I will get 2 views, 1 in which I get all the details about the employee and the other with just the list of employees who have changed.
View: global_temp.EmployeeDetails
EmpID | InsertionDate | Field1 | Field2 | Field3 |
1 | 10-12-2014 | 01Y | 2YXM | ABC |
1 | 10-12-2014 | 01P | 123X | ABY |
1 | 10-12-2014 | 01R | ABY | LMN |
2nd View: global_temp.IDS
|EmpID|
-------
1
Here, IDS view only has one record because in our example only 1 ID has updated records.
Now, I need to replace all the records for EmpID 1 in the main table from the view. To achieve this I first delete the entries for ID 1 from my main table and then insert all the entries from the global_temp.employeedetails view.
There is no limit on how many employee ids will get an update so there can be days where there is only 1 change and there can be days where there and 100k changes. So I will not be able to tell what percent of the table is being deleted.