310

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Abe
  • 22,738
  • 26
  • 82
  • 111
  • 1
    Note that the URL in the question appears EOL. – DaveL17 Jan 29 '17 at 00:18
  • For an idiomatic and performant way, [see this solution below](https://stackoverflow.com/a/41650846/3707607). – Ted Petrou Dec 02 '17 at 03:59
  • Time has marched on... As of this writing, I believe [this solution below](https://stackoverflow.com/a/68876659/758174) is faster (at least in the case where there are lots of duplicates) and also simpler. – Pierre D Aug 21 '21 at 20:55

15 Answers15

369

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10
Zero
  • 74,117
  • 18
  • 147
  • 154
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 14
    Small note: The `cols` and `take_last` parameters are depreciated and have been replaced by the `subset` and `keep` parameters. http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html – Jezzamon Dec 04 '15 at 04:22
  • as @Jezzamon says, `FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead` – tumultous_rooster Feb 23 '16 at 02:35
  • 2
    Is there a reason not to use `df.sort_values(by=['B']).drop_duplicates(subset=['A'], keep='last')`? I mean this sort_values seems safe to me but I have no idea if it actually is. – Little Bobby Tables Nov 26 '16 at 18:16
  • 7
    This answer is now obsolete. See @Ted Petrou's answer below. – cxrodgers Apr 27 '17 at 22:26
  • If you want to use this code but with the case of more than one column in the `group_by`, you can add `.reset_index(drop=True)` `df.groupby(['A','C'], group_keys=False).apply(lambda x: x.ix[x.B.idxmax()]).reset_index(drop=True)` This will reset the index as its default value would be a Multindex compsed from `'A'` and `'C'` – Hamri Said May 31 '17 at 10:22
  • Does this solution works for keep “first” instead of “last”? Thanks – Gonzalo Oct 31 '17 at 17:16
  • This solution does not seem to work for me. I have a column of dates and am trying to remove duplicate dates keeping highest value in other columns. No error but all dates are still there. Any suggestions? – geds133 Jan 04 '19 at 16:09
  • @jezrael for the same question how to groupby multiple columns and keep the maximum value – panda Apr 02 '19 at 13:47
  • its workd with me, but if the data like this : `data = {'A'=[ 1 ,1 ,1,2 ,2, 3] 'B':[10 ,20 ,20 ,30 ,40 ,10]}` is there any way to remove a duplicated and keep the **max** value and the **equal** also, exepected: `output= {'A'=[ 1 ,1 ,2, 3] 'B':[20 ,20 ,40 ,10]}` – Fatima Jan 21 '20 at 19:32
156

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • 3
    This is actually a cleaver approach. I was wondering if it can be generalized by using some `lamba` function while dropping. For example how can I drop only values lesser than say average of those duplicate values. – Dexter Jul 05 '19 at 09:08
  • This is slower than `groupby` (because of the initial `sort_values()` which is `O[n log n]` and that `groupby` avoids). See [a 2021 answer](https://stackoverflow.com/a/68876659/758174). – Pierre D Aug 21 '21 at 20:59
57

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
Gil Baggio
  • 13,019
  • 3
  • 48
  • 37
32

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Nobel
  • 1,485
  • 1
  • 16
  • 19
12

Try this:

df.groupby(['A']).max()
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 1
    D'you know the best idiom to reindex this to look like the original DataFrame? I was trying to figure that out when you ninja'd me. :^) – DSM Sep 19 '12 at 15:14
  • 5
    Neat. What if the dataframe contains more columns (e.g. C, D, E)? Max doesn't seem to work in that case, because we need to specify that B is the only column that needs to be maximized. – Abe Sep 19 '12 at 15:16
  • 1
    @DSM Check the link in the original question. There's some code to reindex the grouped dataframe. – Abe Sep 19 '12 at 15:20
8

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pierre D
  • 24,012
  • 7
  • 60
  • 96
5

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42
Peyman Mohamadpour
  • 17,954
  • 24
  • 89
  • 100
rra
  • 809
  • 1
  • 8
  • 20
4

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
whateveros
  • 61
  • 4
2

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Merlin
  • 24,552
  • 41
  • 131
  • 206
Venkat
  • 29
  • 1
2

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

mistaben
  • 21
  • 1
  • 3
0

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()
Bhagabat Behera
  • 853
  • 7
  • 7
  • Please give a little more context to your answers, explaining how they work and why they are superior or complementary to the answers already available for a question. If they do not provide added value, please refrain from posting additional answers on old questions. Finally, please [format](https://stackoverflow.com/editing-help) your code as a code block by indenting it. – WhoIsJack Jun 24 '18 at 12:03
0

Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.

Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:

df.sort_values(["A", "B"], ascending=False, inplace=True)

Then, drop duplication and keep only the first item, which is already the one with the highest value:

df.drop_duplicates(inplace=True)
kikyo91
  • 37
  • 1
  • 8
0

In case you end up here, and have a dataframe with several equal columns (and some of them are different) and want to keep the original index:

df = (df.sort_values('B', ascending=False)
         .drop_duplicates(list(final_out_combined.columns.difference(['B'],sort=False)))
         .sort_index())

in the line drop_duplicates you can add the columns which can have a difference, so for example:

drop_duplicates(list(final_out_combined.columns.difference(['B', 'C'],sort=False)))

would mean B and C are not checking for duplicates.

PV8
  • 5,799
  • 7
  • 43
  • 87
-1

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})
Mahesh
  • 145
  • 8
  • While this code snippet may solve the question, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations! – Martin Tournoij Jan 14 '17 at 17:37
-11

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]
Abhranil Das
  • 5,702
  • 6
  • 35
  • 42
  • 9
    Maybe I'm wrong on this, but recasting a pandas DataFrame as a set, then converting it back seems like a very inefficient way to solve this problem. I'm doing log analysis, so I'll be applying this to some very big data sets. – Abe Sep 19 '12 at 15:08
  • Sorry, I don't know too much about this particular scenario, so it may be that my generic answer will not turn out to be too efficient for your problem. – Abhranil Das Sep 19 '12 at 15:12