33

Can one perform a left join in pandas that selects only the first match on the right? Example:

left            = pd.DataFrame()
left['age']     = [11, 12]
right           = pd.DataFrame()
right['age']    = [10, 11, 11]
right['salary'] = [ 100, 150, 200 ]
left.merge( right, how='left', on='age' )

Returns

   age  salary
0   11     150
1   11     200
2   12     NaN

But what I would like is to preserve the number of rows of left, by merely taking the first match. That is:

   age  salary
0   11     150
2   12     NaN

So I've been using

left.merge( right.drop_duplicates(['age']), how='left', on='age')

but I believe this makes a full copy of right. And it smells funny.

Is there a more elegant way?

576i
  • 7,579
  • 12
  • 55
  • 92
Quant
  • 4,253
  • 3
  • 17
  • 15
  • 3
    This would straightforward to add as an option to ``merge``, but is not implemented at the moment. – Jeff Oct 08 '14 at 14:32
  • Thanks Jeff. I'll request it. Will leave the question here in case somebody can point out to me how to avoid the copy (or side effect) in the interim. – Quant Oct 08 '14 at 15:39

1 Answers1

8

Yes, you can use groupby to remove your duplicate lines. Do everything you've done to define left and right. Now, I define a new dataframe on your last line:

left2=left.merge( right, how='left', on='age' )
df= left2.groupby(['age'])['salary'].first().reset_index()
df

At first I used a .min(), which will give you the minimum salary at each age, as such:

df= left2.groupby(['age'])['salary'].min().reset_index()

But you were specifically asking about the first match. To do so you use the .first() option. Note: The .reset_index() at the end, just reformats the output of the groupby to be a dataframe again.

samus
  • 147
  • 2
  • 3
  • 2
    This will only work under the assumption that the key 'age' is unique in *left*, so it works in this example, but not as a general way to do left merges, which should leave the *left* dataframe complete – 576i Oct 03 '18 at 17:00
  • 6
    @576i I agree. In addition `left2.groupby(['age'])['salary'].first().reset_index()` is just a more convoluted way to do `left2.drop_duplicates(['age'])` as it was proposed by the OP. – Jean Paul Jan 27 '20 at 17:25