How to perform an anti-join, or left outer join, (get all the rows in a dataset which are not in another based on multiple keys) in pandas

Question

I have two datasets:

df1 = pd.DataFrame(data = {'label1': ['A', 'A', 'B', 'C'], 'label2': ['a', 'b', 'c', 'd'], 'value': [1,2,3,4]})

df2 = pd.DataFrame(data = {'label1': ['A', 'A', 'D', 'E'], 'label'2': ['a', 'd', 'c','e'], 'value2': [10,12,23,14]})

I would like to perform an anti-join so that the resulting data frame contains the rows of df1 where the key [['label1', 'label2']] is not found in df2.

The resulting df should be:

label1     label2     value
A          b          2
B          c          3
C          d          4

In R using dplyr, the code would be:

df3 = anti_join(df1, df2, by = c("label1", "label2"))

Thanks for your help.

score 7 · Accepted Answer · answered Jul 06 '18 at 18:08

7

Using isin with tuple

df1[~df1[['label1','label2']].apply(tuple,1).isin(df2[['label1','label2']].apply(tuple,1))]
Out[140]: 
  label1 label2  value
1      A      b      2
2      B      c      3
3      C      d      4

answered Jul 06 '18 at 18:08

BENY

317,841
20
164
234

This code did the trick.. My resulting data frame correct! Thanks. – Olivia Watkins Jul 06 '18 at 19:45

harvpan · Answer 2 · 2018-07-06T18:18:58.160

4

Option1

Simply perform an inner join and remove the intersecting rows from df1.

df1.drop(df1.merge(df2).index)

Option2:

You need to do a left join and see how many of the rows shows up as null for the column from df2.

mer = df1.merge(df2, how='left')
mer[mer['value2'].isnull()].drop(['value2'], 1)

Output:

  label1    label2  value
1   A          b    2
2   B          c    3
3   C          d    4

edited Jul 06 '18 at 18:18

answered Jul 06 '18 at 18:03

harvpan

8,571
2
18
36

1

Option 1 is great. - Thanks. – Olivia Watkins Jul 06 '18 at 19:06
Sorry, but when I apply the option 1 code to my larger dataset, it does not work. I think the problem is that the merge creates a new index. I end up with a dataset the correct size, but with the wrong rows. I think the example here works because the only row to keep is the first. – Olivia Watkins Jul 06 '18 at 19:40
@OliviaWatkins, do you get any error? In this case, option 2 would work. – harvpan Jul 06 '18 at 19:50

How to perform an anti-join, or left outer join, (get all the rows in a dataset which are not in another based on multiple keys) in pandas

2 Answers2

Related