7

One situation I sometimes encounter is, I have two dataframes (df1, df2) and I want to create a new dataframe (df3) based on the intersection of multiple columns between df1 and df2.

For example, I want to create df3 by filtering df1 by columns Campaign and Group.

import pandas as pd
df1 = pd.DataFrame({'Campaign':['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 3', 'Campaign 4'], 'Group':['Some group', 'Arbitrary Group', 'Group 1', 'Group 2', 'Done Group'], 'Metric':[245,91,292,373,32]}, columns=['Campaign', 'Group', 'Metric'])
df2 = pd.DataFrame({'Campaign':['Campaign 3', 'Campaign 3'], 'Group':['Group 1', 'Group 2'], 'Metric':[23, 456]}, columns=['Campaign', 'Group', 'Metric'])

df1

     Campaign            Group  Metric
0  Campaign 1       Some group     245
1  Campaign 2  Arbitrary Group      91
2  Campaign 3          Group 1     292
3  Campaign 3          Group 2     373
4  Campaign 4       Done Group      32

df2

     Campaign    Group  Metric
0  Campaign 3  Group 1      23
1  Campaign 3  Group 2     456

I know I can do this with merge...

df3 = df1.merge(df2, how='inner', on=['Campaign', 'Group'], suffixes=('','_del'))
#df3
     Campaign    Group  Metric  Metric_del
0  Campaign 3  Group 1     292          23
1  Campaign 3  Group 2     373         456

but then I have to figure out how to drop columns that end with _del. I guess this:

df3.select(lambda x: not re.search('_del', x), axis=1)
##The result I'm going for but required merge, then select (2-steps)
     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

Questions

What I'm mainly interested in is returning df1 that's simply filtered on df2's Campaign|Group values.

  1. Is there a better way to return df1 without resorting to merge?

  2. Is there a way to merge but NOT return df2's columns to the merge and returning only df1's columns?

Jarad
  • 17,409
  • 19
  • 95
  • 154

2 Answers2

5

Assuming that your df1 and df2 have exactly the same columns. You can first set those join-key columns as index and use df1.reindex(df2.index) and a further .dropna() to produce the intersection.

df3 = df1.set_index(['Campaign', 'Group'])
df4 = df2.set_index(['Campaign', 'Group'])
# reindex first and dropna will produce the intersection
df3.reindex(df4.index).dropna(how='all').reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

Edit:

Use .isin when key is not unique.

# create some duplicated keys and values
df3 = df3.append(df3)
df4 = df4.append(df4)

# isin
df3[df3.index.isin(df4.index)].reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373
2  Campaign 3  Group 1     292
3  Campaign 3  Group 2     373
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • In example data above, this works great but when I apply this concept to my real datasets, I get `Exception: cannot handle a non-unique multi-index!`. I verified that `df1` and `df2` have the exact same columns. Any ideas of the cause and how to fix? – Jarad Aug 10 '15 at 18:54
  • 2
    @Jarad Can you try using the `.isin` to act as the filter rule, for example `df3[df3.index.isin(df4.index)].reset_index()`. – Jianxun Li Aug 10 '15 at 19:07
  • That did the trick. Really appreciate your help Jianxun Li. This concept seems like a great alternative approach to `merge`. – Jarad Aug 10 '15 at 19:23
  • @Jarad You are most welcome. Glad that it helped. :-) – Jianxun Li Aug 10 '15 at 19:24
0

Alternatively, you can use groupby and filter as follows:

# Compute the set of values you're interested in.
# In your example, this will be {('Campaign 3', 'Group 1'), ('Campaign 3', 'Group 2')}
interesting_groups = set(df2[['Campaign', 'Group']].apply(tuple, axis=1))
# Filter df1, keeping only values in that set
result = df1.groupby(['Campaign', 'Group']).filter(
    lambda x: x.name in interesting_groups
)

See the filter docs for another example.

Sjlver
  • 1,227
  • 1
  • 12
  • 28