8

I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.

So if I have df2 equals to:

     A  B
0  wer  6
1  tyu  7

And df1 equals to:

     A  B  C
0  qwe  5  a
1  wer  6  s
2  wer  6  d
3  rty  9  f
4  tyu  7  g
5  tyu  7  h
6  tyu  7  j
7  iop  1  k

The final result should be like so:

     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k

I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.

Here is the code I wrote in case you need it: import pandas as pd

df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
                    'B' : [    5,     6,     6,     9,     7,     7,     7,     1],
                    'C' : ['a'  ,   's',   'd',   'f',   'g',   'h',   'j',   'k']})

df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
                    'B' : [    6,     7]})

for i, row in df2.iterrows():
    df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Federico Gentile
  • 5,650
  • 10
  • 47
  • 102

7 Answers7

21

Use merge with outer join with filter by query, last remove helper column by drop:

df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
       .query("_merge != 'both'")
       .drop('_merge', axis=1)
       .reset_index(drop=True)
print (df)
     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
12

The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:

df1.drop(df2.index, axis=0,inplace=True)
Elliot Ben
  • 129
  • 1
  • 6
  • 8
    I believe that this does not answer the question. It assumes that identical row will have the same index. However in the example posted in the question this is not the case. As a result you will remove rows with index 0 and 1 from df1. – Mewtwo Mar 31 '19 at 21:13
  • genius answer thanks! one can even extend it to column names such as : df.spec_col.drop(drop.index,axis = 0) – brygid Aug 06 '21 at 19:59
3

You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.

df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
                   .reset_index(drop=True)
Out[115]: 
     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
2

pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:

import pandas as pd

df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
                    'B' : [    5,     6,     6,     9,     7,     7,     7,     1],
                    'C' : ['a'  ,   's',   'd',   'f',   'g',   'h',   'j',   'k']})

df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
                    'B' : [    6,     7]})

unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)

printing:

     A  B  C
0  qwe  5  a
1  rty  9  f
2  iop  1  k
asongtoruin
  • 9,794
  • 3
  • 36
  • 47
1

I think the cleanest way can be:

We have base dataframe D and want to remove a subset D1. Let the output be D2

D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()
0

I find this other alternative useful too:

pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)


         A   B  C
    0   qwe  5  a
    1   rty  9  f
    2   iop  1  k

keep=False drops both duplicates.

It doesn't require to put all the equal columns between the two df, so I find that a bit easier.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
0

used this version to erase all the rows that have a matching index between df1 and df2 but I was getting errors because it could not find particular indexes, I turned off the errors and it worked perfectly. Thanks:

df1.drop(df2.index, axis=0, inplace=True, errors = 'ignore')