I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C'
takes its max, for each group of ['A', 'B']
, we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates()
solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index()
to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)