Can one perform a left join in pandas that selects only the first match on the right?

Question

Can one perform a left join in pandas that selects only the first match on the right? Example:

left            = pd.DataFrame()
left['age']     = [11, 12]
right           = pd.DataFrame()
right['age']    = [10, 11, 11]
right['salary'] = [ 100, 150, 200 ]
left.merge( right, how='left', on='age' )

Returns

   age  salary
0   11     150
1   11     200
2   12     NaN

But what I would like is to preserve the number of rows of left, by merely taking the first match. That is:

   age  salary
0   11     150
2   12     NaN

So I've been using

left.merge( right.drop_duplicates(['age']), how='left', on='age')

but I believe this makes a full copy of right. And it smells funny.

Is there a more elegant way?

This would straightforward to add as an option to ``merge``, but is not implemented at the moment. — Jeff, Oct 08 '14 at 14:32
Thanks Jeff. I'll request it. Will leave the question here in case somebody can point out to me how to avoid the copy (or side effect) in the interim. — Quant, Oct 08 '14 at 15:39

score 8 · Answer 1 · answered Oct 09 '17 at 18:38

8

Yes, you can use groupby to remove your duplicate lines. Do everything you've done to define left and right. Now, I define a new dataframe on your last line:

left2=left.merge( right, how='left', on='age' )
df= left2.groupby(['age'])['salary'].first().reset_index()
df

At first I used a .min(), which will give you the minimum salary at each age, as such:

df= left2.groupby(['age'])['salary'].min().reset_index()

But you were specifically asking about the first match. To do so you use the .first() option. Note: The .reset_index() at the end, just reformats the output of the groupby to be a dataframe again.

answered Oct 09 '17 at 18:38

samus

147
2
3

2

This will only work under the assumption that the key 'age' is unique in *left*, so it works in this example, but not as a general way to do left merges, which should leave the *left* dataframe complete – 576i Oct 03 '18 at 17:00
6

@576i I agree. In addition `left2.groupby(['age'])['salary'].first().reset_index()` is just a more convoluted way to do `left2.drop_duplicates(['age'])` as it was proposed by the OP. – Jean Paul Jan 27 '20 at 17:25

Can one perform a left join in pandas that selects only the first match on the right?

1 Answers1

Linked