Remove rows from dataframe df1 if their columnS valueS exist in other dataframe df2

Question

I tried this :

res = df1[~(getattr(df1, 'A').isin(getattr(df2, 'A')) & getattr(df1, 'C').isin(getattr(df2, 'C')))]

It works BUT the list of columns is variable in this example columns = ['A', 'C'] how can I loop over it to get the above expression dynamically according to the values of the list 'columns'

exp: df1:

       A      B  C   D
0     oo    one  0   0
1    bar   one1  1   2
2    foo   two2  2   4
3    bar   one1  3   6
4    foo    two  4   8
5    bar    two  5  10
6    foo    one  6  12
7  fowwo  three  7  14

df2:

       A      B  C   D
0     oo    one  0   0
2    foo   two2  2   4
3    bar   one1  3   6
4    foo    two  4   8
5    bar    two  5  10
6    foo    one  6  12
7  fowwo  three  7  14

res:

     A     B  C  D
1  bar  one1  1  2

Can you explain how'd you got the output `bar one1`, it also seems to be present in both the data frames? — Shubham Sharma, Jun 23 '20 at 12:37
yes that is correct just I changed the list of columns from ['A','B'] to ['A', 'C'] thanks for mentioning that — asmatrk, Jun 23 '20 at 13:37

DavideBrex · Answer 1 · 2020-06-24T11:50:08.827

1

Use:

column_list = ["A","C"]
df1[(~pd.concat((getattr(df1, col).isin(getattr(df2, col)) for col in column_list), axis=1 )).any(1)]

Output:

    A   B       C   D
1   bar one1    1   2

EDIT

The new situation you explained in the comments can be solved with merge.

Dataframes:

df3= pd.DataFrame({'A': '1010994595 1017165396 1020896102 1028915753 1028915753 1030811227 1033837508 1047224448 1047559040 1053827106 1094815936 1113339076 1115345471 1121416375 1122392586 1122981502 1132224809 '.split(), 'B': '99203 99232 99233 99231 99291 99291 99232 99232 99242 99232 99244 G0425 99213 99203 99606 99243 99214'.split(), 'C': np.arange(17), 'D': np.arange(17) * 2})
df4= pd.DataFrame({'A': '1115345471 1113339076 1020896102 1047224448 1053827106 1121416375 1122392586 1028915753 1132224809 1030811227 1094815936 1033837508 1047559040 1122981502 1028915753 1030811227 1017165396 '.split(), 'B': '99213 G0425 99291 99232 99291 99243 99606 99291 99214 99291 99244 99233 99242 99243 99291 99291 99232 '.split(), 'C': np.arange(17), 'D': np.arange(17) * 2})

Code to select rows from df4 that are not in df3 (for columns in column_list):

list_col = ["A","B"]
df4[df4.merge(df3.drop_duplicates(), on=list_col, how='left', indicator=True)["_merge"] == "left_only"]

Output:

    A           B       C   D
2   1020896102  99291   2   4
4   1053827106  99291   4   8
5   1121416375  99243   5   10
11  1033837508  99233   11  22

If you want to reset the index for the new table add .reset_index(drop=True) at the end

edited Jun 24 '20 at 11:50

answered Jun 23 '20 at 12:44

DavideBrex

2,374
1
10
23

I need the comparison between df1 and df2 to be build on a specific columns not all of them it wan't give the same result ! – asmatrk Jun 23 '20 at 13:11
import pandas as pd import numpy as np df3= pd.DataFrame({'A': '1010994595 1017165396 1020896102 1028915753 1028915753 1030811227 1033837508 1047224448 1047559040 1053827106 1094815936 1113339076 1115345471 1121416375 1122392586 1122981502 1132224809 '.split(), 'B': '99203 99232 99233 99231 99291 99291 99232 99232 99242 99232 99244 G0425 99213 99203 99606 99243 99214'.split(), 'C': np.arange(17), 'D': np.arange(17) * 2}) – asmatrk Jun 24 '20 at 10:37
df4= pd.DataFrame({'A': '1115345471 1113339076 1020896102 1047224448 1053827106 1121416375 1122392586 1028915753 1132224809 1030811227 1094815936 1033837508 1047559040 1122981502 1028915753 1030811227 1017165396 '.split(), 'B': '99213 G0425 99291 99232 99291 99243 99606 99291 99214 99291 99244 99233 99242 99243 99291 99291 99232 '.split(), 'C': np.arange(17), 'D': np.arange(17) * 2}) – asmatrk Jun 24 '20 at 10:39
df4[(~pd.concat((getattr(df4, col).isin(getattr(df3, col)) for col in ['A', 'B']), axis=1 )).any(1)] – asmatrk Jun 24 '20 at 10:39
for this example it doesn't give correct output (sorry for format I can't reformat it to code in this comment ) – asmatrk Jun 24 '20 at 10:40
the output is an empty dataframe however the rows in df4 where column 'A' = 1053827106 , 1121416375 , 1033837508 , 1020896102 they have values of column 'B' different of what we have on df3 so I expect to get this rows as a result – asmatrk Jun 24 '20 at 10:48
See the updated answer and I hope I could help you with this problem – DavideBrex Jun 24 '20 at 11:50
unfortunately I get this warning when I used your code on my project : UserWarning: Boolean Series key will be reindexed to match DataFrame index. BUT it works fine for the example out of the code ready I didn't understand why by finally I solve it by my own ^^ thanks for your help and time – asmatrk Jun 24 '20 at 18:17

score 0 · Accepted Answer · answered Jun 24 '20 at 18:22

0

The answer is :

columns = ['A', 'B']
common_data_between_df1_and_df2_relative_to_columns = df1.merge(df2, on=columns , right_index=True)
res = df1[~(df1.index.isin(common_data_between_df1_and_df2 .index))].dropna()

answered Jun 24 '20 at 18:22

asmatrk

237
2
9

Remove rows from dataframe df1 if their columnS valueS exist in other dataframe df2

2 Answers2