SQL query equivalent in python to: delete only one identical record in a dataframe

Question

Problem-stmt: To Delete a column from a table[MDF] based on a condition[OP_DIRECTIVE = 'D'] from a different table[INC]. Both the tables have identical columns.

I am looking for a SQL Equivalent Query in Python to Delete one identical record even if there are multiple records match the DELETE condition

I've written the same in SQL [working]:- Approach to my solution: Copy unmatched rows to a work table, then truncate the original table and replace with contents of the work table. One way to identify unmatched rows would be to tag each of the input rows in a set of duplicates with a unique number, something like this:

INSERT work_table SELECT MI.col1, MI.col2, ...
FROM 
  (SELECT M.*,
   ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
   FROM MORTALITY M) MI
LEFT JOIN 
  (SELECT I.*, 
   ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
   FROM INC_MORTALITY I
   WHERE OP_DIRECTIVE='D') INC
ON MI.join_col1 = INC.join_col1
AND MI.join_col2 = INC.join_col2
... all the columns except for 'OP_DIRECTIVE'
AND MI.ROWNUM = INC.ROWNUM
WHERE INC.ROWNUM IS NULL /* "anti-join" keeps only unmatched rows */
;
DELETE FROM MORTALITY;
INSERT MORTALITY SELECT * FROM work_table;

What i've tried:

import os
import time
import pandas as pd

filePath = '/Users/test_files'
timestr = time.strftime("%Y-%m-%d-%-H%M%S")
fileName = ‘left_join' + timestr + '.txt'

if os.path.exists(filePath):
        MDF = pd.read_csv(“mdf.txt", sep='|')
        INC = pd.read_csv(“inc.txt”, sep='|')
        
       result = MDF.merge(
                    INC_D,
                    on=['data_source','dd_imp_flag','dob','dod','death_verification','gender_probability','gender','token_1','token_2','token_4','token_5','token_7','token_16','token_key'],
                    how = 'left',
                    suffixes=('', '_delme'))
        cols = result.columns.difference(MDF.columns)
        result = result.loc[result[cols].isnull().all(axis=1), MDF.columns.tolist()]

        result.to_csv(os.path.join(filePath, fileName), sep="|", index=False)  # remove header=None if header is needed
        print("Data export successful.")
else:
    print("File path does not exist.")

But this deletes all the records matching 'D' as an indicator, clearly i am missing ROW_NUMBER here, so i am wondering how to achieve it in python

MDF-previously

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

INC

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|D|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

MDF-after-updates-expected output

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

if you are looking to do a rownumber in pandas, then have a look at [groupby.cumcount](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.cumcount.html) — Ben.T, Jul 18 '21 at 17:57
you don't have the row `SSA|D|0|1940-12-01|1859-09-01|3|1.0|F|...` in your expected result, while it is both in MDF and INC, can you explain why you did not keep it? — Ben.T, Jul 19 '21 at 00:15
@Ben.T yes correct, because INC gives us all the rows that has to be deleted from MDF[main history file], so this record was common in both the df's[except for 'op_directive' indicator column] and it was a 'SINGLE' occurrence so it has to be deleted from the main file, Hope you got what i am trying to say? — Mando, Jul 19 '21 at 04:58
@Ben.T To be accurate, this is the scenario i wanted to achieve: Delete rows in the master table[MDF] which have a OP_DIRECTIVE = 'D' as the operation in the weekly update[INC]. For a given delete row, you should delete a single row in the master table which matches the delete record on all fields aside from the “D” operation column. Warning: please ensure you only delete, or mark as deleted, one record, even if more than one historical record fully matches this new delete record - [Note: these are the exact words from Business :)] — Mando, Jul 19 '21 at 05:01

score 1 · Accepted Answer · answered Jul 19 '21 at 13:02

1

With your explanation in the comments and the sentence "Warning: please ensure you only delete, or mark as deleted, one record, even if more than one historical record fully matches this new delete record", I believe a simple way to achieve your result is to use duplicated that would mark as True all duplicated rows, starting the second duplicates

result.loc[result[cols].isnull().all(axis=1)
           |result.duplicated(subset=MDF.columns, keep='first'), # add this condition
           MDF.columns.tolist()]

answered Jul 19 '21 at 13:02

Ben.T

29,160
6
32
54

1

okay, let me try that, BTW thanks for taking time to look into this :) – Mando Jul 19 '21 at 13:27
1

WOW!! it worked, this is exactly what i was looking for. Thanks Ben, appreciate it – Mando Jul 19 '21 at 13:33
@Mando basically, `duplicated` will consider `True` any duplicate rows, starting the second duplicated row. So in your case, the first row of each group that exists in INC will be False, and the rest True, in the case that there is a merge between MDF and INC. If there is no merge happening, then `result[cols].isnull(...` will keep this row. It is how I understand the warning from your business – Ben.T Jul 20 '21 at 01:53

SQL query equivalent in python to: delete only one identical record in a dataframe

1 Answers1