Remove one dataframe from another with Pandas

Question

I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.

So if I have df2 equals to:

     A  B
0  wer  6
1  tyu  7

And df1 equals to:

     A  B  C
0  qwe  5  a
1  wer  6  s
2  wer  6  d
3  rty  9  f
4  tyu  7  g
5  tyu  7  h
6  tyu  7  j
7  iop  1  k

The final result should be like so:

     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k

I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.

Here is the code I wrote in case you need it: import pandas as pd

df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
                    'B' : [    5,     6,     6,     9,     7,     7,     7,     1],
                    'C' : ['a'  ,   's',   'd',   'f',   'g',   'h',   'j',   'k']})

df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
                    'B' : [    6,     7]})

for i, row in df2.iterrows():
    df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)

jezrael · Accepted Answer · 2017-06-14T13:44:43.293

21

Use merge with outer join with filter by query, last remove helper column by drop:

df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
       .query("_merge != 'both'")
       .drop('_merge', axis=1)
       .reset_index(drop=True)
print (df)
     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k

edited Jun 14 '17 at 13:44

answered Jun 14 '17 at 13:29

jezrael

822,522
95
1,334
1,252

Is it possible to specify the columns names 'A' and 'B'? – Federico Gentile Jun 14 '17 at 13:44
1

Yes, sure, add parameter `on` – jezrael Jun 14 '17 at 13:44
1

dekujiu moc kamarad! – Federico Gentile Jun 14 '17 at 13:46

score 12 · Answer 2 · answered Nov 29 '17 at 15:24

12

The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:

df1.drop(df2.index, axis=0,inplace=True)

answered Nov 29 '17 at 15:24

Elliot Ben

129
1
6

8

I believe that this does not answer the question. It assumes that identical row will have the same index. However in the example posted in the question this is not the case. As a result you will remove rows with index 0 and 1 from df1. – Mewtwo Mar 31 '19 at 21:13
genius answer thanks! one can even extend it to column names such as : df.spec_col.drop(drop.index,axis = 0) – brygid Aug 06 '21 at 19:59

Allen Qin · Answer 3 · 2017-06-14T13:42:51.957

3

You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.

df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
                   .reset_index(drop=True)
Out[115]: 
     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k

edited Jun 14 '17 at 13:42

answered Jun 14 '17 at 13:30

Allen Qin

19,507
8
51
67

asongtoruin · Answer 4 · 2017-06-14T13:41:01.127

pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:

import pandas as pd

df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
                    'B' : [    5,     6,     6,     9,     7,     7,     7,     1],
                    'C' : ['a'  ,   's',   'd',   'f',   'g',   'h',   'j',   'k']})

df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
                    'B' : [    6,     7]})

unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)

printing:

     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k

score 1 · Answer 5 · answered Sep 22 '20 at 06:12

1

I think the cleanest way can be:

We have base dataframe D and want to remove a subset D1. Let the output be D2

D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()

answered Sep 22 '20 at 06:12

Sameer Saurabh

11
4

score 0 · Answer 6 · edited Dec 26 '21 at 17:26

0

I find this other alternative useful too:

pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)


         A   B  C
    0   qwe  5  a
    1   rty  9  f
    2   iop  1  k

keep=False drops both duplicates.

It doesn't require to put all the equal columns between the two df, so I find that a bit easier.

edited Dec 26 '21 at 17:26

blackbishop

30,945
11
55
76

answered Dec 26 '21 at 14:21

Ian Mancilla

1
2

score 0 · Answer 7 · answered Apr 21 '23 at 15:43

used this version to erase all the rows that have a matching index between df1 and df2 but I was getting errors because it could not find particular indexes, I turned off the errors and it worked perfectly. Thanks:

df1.drop(df2.index, axis=0, inplace=True, errors = 'ignore')

Remove one dataframe from another with Pandas

7 Answers7

Linked