Selecting rows in which more than one value are in another DataFrame

Question

I have a DataFrame with the following form:

      day         u           a
 0 2018-03-01   5658599  suggestion
 1 2018-03-01  10405594  suggestion
 2 2018-03-01   4142545  suggestion
 3 2018-03-01  10397546  suggestion
 4 2018-03-01  10296737  suggestion

And I want to select the pairs of (day, u) that appear in:

     user_id access_date access_platform
 0  10405592  2018-03-01             IOS
 1  10405594  2018-03-01         ANDROID
 2  10405593  2018-03-01         ANDROID
 3  10405596  2018-03-01         ANDROID
 4  10405597  2018-03-01         ANDROID

Where u and user_id are the same as well as day and access_date for the same row.

The DataFrames are huge, is there a efficient way of doing it that doesn't require to iterate row by row one of the dataframes?

A entry of (day, u) can appear multiple times in a DataFrame, there are other fields which are not showing.

`df1.merge(df2.rename(columns={'user_id': 'u', 'access_date': 'day'}))` — piRSquared, Jun 05 '18 at 14:30

score 2 · Accepted Answer · answered Jun 05 '18 at 14:29

2

One way is an inner merge:

res = df1.merge(df2, how='inner', left_on=['day', 'u'],
                right_on=['access_date', 'user_id'])\
         .loc[:, df1.columns]

print(res)

          day         u           a
0  2018-03-01  10405594  suggestion

answered Jun 05 '18 at 14:29

jpp

159,742
34
281
339

Thank you it worked, I tried before doing something similar but it didn't work, this worked! – Jorge Rodriguez Molinuevo Jun 06 '18 at 07:50

jezrael · Answer 2 · 2018-06-05T14:57:06.017

Use merge with drop_duplicates for remove possible duplicates in both DataFrames:

df1 = df1.drop_duplicates(subset=['day','u'])
df2 = df2.drop_duplicates(subset=['access_date','user_id'])

df = df2.merge(df1, left_on=['access_date','user_id'], right_on=['day','u'])[['day','u']]

print (df)
          day         u
0  2018-03-01  10405594

@piRSquareds comment solution
df = (df1.merge(df2.rename(columns={'user_id': 'u', 'access_date': 'day'}))[['day','u']])

print (df)
          day         u
0  2018-03-01  10405594

If want all columns:

df = df2.merge(df1, left_on=['access_date','user_id'], right_on=['day','u'])[df1.columns]
print (df)

          day         u           a
0  2018-03-01  10405594  suggestion

Selecting rows in which more than one value are in another DataFrame

2 Answers2