Remove duplicates by columns A, keeping the row with the highest value in column B

Question

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

Should turn into this:

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

For an idiomatic and performant way, [see this solution below](https://stackoverflow.com/a/41650846/3707607). — Ted Petrou, Dec 02 '17 at 03:59
Time has marched on... As of this writing, I believe [this solution below](https://stackoverflow.com/a/68876659/758174) is faster (at least in the case where there are lots of duplicates) and also simpler. — Pierre D, Aug 21 '21 at 20:55

score 369 · Accepted Answer · edited Oct 04 '17 at 18:03

369

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

edited Oct 04 '17 at 18:03

Zero

74,117
18
147
154

answered Oct 25 '12 at 00:10

Wes McKinney

101,437
32
142
108

14

Small note: The `cols` and `take_last` parameters are depreciated and have been replaced by the `subset` and `keep` parameters. http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html – Jezzamon Dec 04 '15 at 04:22
as @Jezzamon says, `FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead` – tumultous_rooster Feb 23 '16 at 02:35
2

Is there a reason not to use `df.sort_values(by=['B']).drop_duplicates(subset=['A'], keep='last')`? I mean this sort_values seems safe to me but I have no idea if it actually is. – Little Bobby Tables Nov 26 '16 at 18:16
7

This answer is now obsolete. See @Ted Petrou's answer below. – cxrodgers Apr 27 '17 at 22:26
If you want to use this code but with the case of more than one column in the `group_by`, you can add `.reset_index(drop=True)` `df.groupby(['A','C'], group_keys=False).apply(lambda x: x.ix[x.B.idxmax()]).reset_index(drop=True)` This will reset the index as its default value would be a Multindex compsed from `'A'` and `'C'` – Hamri Said May 31 '17 at 10:22
Does this solution works for keep “first” instead of “last”? Thanks – Gonzalo Oct 31 '17 at 17:16
This solution does not seem to work for me. I have a column of dates and am trying to remove duplicate dates keeping highest value in other columns. No error but all dates are still there. Any suggestions? – geds133 Jan 04 '19 at 16:09
@jezrael for the same question how to groupby multiple columns and keep the maximum value – panda Apr 02 '19 at 13:47
its workd with me, but if the data like this : `data = {'A'=[ 1 ,1 ,1,2 ,2, 3] 'B':[10 ,20 ,20 ,30 ,40 ,10]}` is there any way to remove a duplicated and keep the **max** value and the **equal** also, exepected: `output= {'A'=[ 1 ,1 ,2, 3] 'B':[20 ,20 ,40 ,10]}` – Fatima Jan 21 '20 at 19:32

Ted Petrou · Answer 2 · 2017-08-09T16:24:35.543

156

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

edited Aug 09 '17 at 16:24

answered Jan 14 '17 at 14:04

Ted Petrou

59,042
19
131
136

3

This is actually a cleaver approach. I was wondering if it can be generalized by using some `lamba` function while dropping. For example how can I drop only values lesser than say average of those duplicate values. – Dexter Jul 05 '19 at 09:08
This is slower than `groupby` (because of the initial `sort_values()` which is `O[n log n]` and that `groupby` avoids). See [a 2021 answer](https://stackoverflow.com/a/68876659/758174). – Pierre D Aug 21 '21 at 20:59

score 57 · Answer 3 · answered Mar 06 '19 at 11:13

57

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

answered Mar 06 '19 at 11:13

Gil Baggio

13,019
3
48
37

1

Best solution. Thanks. – Flavio Jul 18 '19 at 09:18
Glad to help. @Flavio – Gil Baggio Jul 25 '19 at 12:43
1

My data frame has 10 columns, and I used this code to delete duplicates from three columns. However, it deleted the rows from the rest of the columns. Is there any way to delete the duplicates only for the 4 last columns? – Sofia Dec 10 '19 at 10:56
9

But OP wants to keep the highest value in column B. This might work if you sorted first. But then it's basically Ted Petrou's answer. – Teepeemm Jan 24 '20 at 19:25
remember to assign df back to df `df = df.drop_duplicates`. doing `df.drop_duplicates(...)` alone won't work – chrizonline Jul 05 '21 at 01:46
3

This answer assumes that the columns are sorted, which was not specified in the question. – Denziloe Dec 22 '21 at 01:15
1

Very nice solution, thanks a lot – Hùng Nguyễn May 22 '22 at 20:20
This solution is wrong. You need to sort your dataframe ascending first for this solution to work reliably. – GuD Oct 28 '22 at 01:34

Nobel · Answer 4 · 2020-03-18T10:59:15.043

32

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

edited Mar 18 '20 at 10:59

answered Mar 18 '20 at 10:46

Nobel

1,485
1
16
19

score 12 · Answer 5 · answered Sep 19 '12 at 15:10

12

Try this:

df.groupby(['A']).max()

answered Sep 19 '12 at 15:10

eumiro

207,213
34
299
261

1

D'you know the best idiom to reindex this to look like the original DataFrame? I was trying to figure that out when you ninja'd me. :^) – DSM Sep 19 '12 at 15:14
5

Neat. What if the dataframe contains more columns (e.g. C, D, E)? Max doesn't seem to work in that case, because we need to specify that B is the only column that needs to be maximized. – Abe Sep 19 '12 at 15:16
1

@DSM Check the link in the original question. There's some code to reindex the grouped dataframe. – Abe Sep 19 '12 at 15:20

Pierre D · Answer 6 · 2021-08-21T21:01:08.510

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 5 · Answer 7 · edited May 22 '20 at 04:25

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

score 4 · Answer 8 · answered Sep 01 '17 at 11:15

4

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

answered Sep 01 '17 at 11:15

whateveros

61
4

how is this any different than other posts? – DJK Oct 27 '17 at 04:33

score 2 · Answer 9 · edited Aug 05 '17 at 18:48

2

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

edited Aug 05 '17 at 18:48

Merlin

24,552
41
131
206

answered May 27 '17 at 13:30

Venkat

29
1

score 2 · Answer 10 · answered Sep 20 '19 at 17:25

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

Bhagabat Behera · Answer 11 · 2018-06-24T12:43:15.090

0

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

edited Jun 24 '18 at 12:43

answered Jun 24 '18 at 11:34

Bhagabat Behera

853
7
7

Please give a little more context to your answers, explaining how they work and why they are superior or complementary to the answers already available for a question. If they do not provide added value, please refrain from posting additional answers on old questions. Finally, please [format](https://stackoverflow.com/editing-help) your code as a code block by indenting it. – WhoIsJack Jun 24 '18 at 12:03

score 0 · Answer 12 · answered Sep 02 '22 at 15:53

Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.

Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:

df.sort_values(["A", "B"], ascending=False, inplace=True)

Then, drop duplication and keep only the first item, which is already the one with the highest value:

df.drop_duplicates(inplace=True)

score 0 · Answer 13 · answered Aug 14 '23 at 14:05

In case you end up here, and have a dataframe with several equal columns (and some of them are different) and want to keep the original index:

df = (df.sort_values('B', ascending=False)
         .drop_duplicates(list(final_out_combined.columns.difference(['B'],sort=False)))
         .sort_index())

in the line drop_duplicates you can add the columns which can have a difference, so for example:

drop_duplicates(list(final_out_combined.columns.difference(['B', 'C'],sort=False)))

would mean B and C are not checking for duplicates.

score -1 · Answer 14 · answered Jan 14 '17 at 15:16

-1

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

answered Jan 14 '17 at 15:16

Mahesh

145
8

While this code snippet may solve the question, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations! – Martin Tournoij Jan 14 '17 at 17:37

Abhranil Das · Answer 15 · 2012-09-19T15:10:29.350

-11

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

edited Sep 19 '12 at 15:10

answered Sep 19 '12 at 15:03

Abhranil Das

5,702
6
35
42

9

Maybe I'm wrong on this, but recasting a pandas DataFrame as a set, then converting it back seems like a very inefficient way to solve this problem. I'm doing log analysis, so I'll be applying this to some very big data sets. – Abe Sep 19 '12 at 15:08
Sorry, I don't know too much about this particular scenario, so it may be that my generic answer will not turn out to be too efficient for your problem. – Abhranil Das Sep 19 '12 at 15:12

Remove duplicates by columns A, keeping the row with the highest value in column B

15 Answers15

Linked

Related