For the same value in column a, sort values in column b and choose top 5

Question

I have a dataframe that has two columns col_1 and col_2. Values in column col_2 corresponds to values in column col_1.

print (df)
   col_1  col_2
1      a     12
2      a     33
3      a     11
4      a      4
5      a     42
6      a     66
7      a      9
8      b     12
9      b     34
10     b     42
11     b     64
12     b     86
13     b      2

What i am trying to do is that for each value in col_1 (a, b, c,..), i want to sort the corresponding values in col_2 and select ONLY the top 5 values. The new dataframe is expected to be like this:

I tried the of dropping duplicates applied here since col_2 can sometimes have duplicates. But, it didn't work.

df.sort_values('col_2', ascending=False).drop_duplicates('col_a').sort_index()

Any suggestions will be appreciated

jezrael · Accepted Answer · 2019-09-10T10:52:57.193

4

Use instead drop_duplicates function GroupBy.head and also add first column to DataFrame.sort_values with ascending sorting:

n = 5
df1 = df.sort_values(['col_1','col_2'], ascending=[True, False]).groupby('col_1').head(n)
print (df1)
   col_1  col_2
6      a     66
5      a     42
2      a     33
1      a     12
3      a     11
12     b     86
11     b     64
10     b     42
9      b     34
8      b     12

edited Sep 10 '19 at 10:52

answered Sep 10 '19 at 09:26

jezrael

822,522
95
1,334
1,252

Perhaps you don't need to assign a variable, can be `df.sort_values('col_2', ascending=False).groupby('col_1').head(5)`, however it doesn't provide the exact expected data. – Karn Kumar Sep 10 '19 at 09:52

score 2 · Answer 2 · answered Sep 10 '19 at 09:38

Using GroupBy.Series.nlargest

df.groupby('col_1')['col_2'].nlargest(5)

Output

col_1    
a      5     66
       4     42
       1     33
       0     12
       2     11
b      11    86
       10    64
       9     42
       8     34
       7     12

To get the correct index, use reset_index:

df.groupby('col_1')['col_2'].nlargest(5).reset_index(level=0).reset_index(drop=True)

Output

  col_1  col_2
0     a     66
1     a     42
2     a     33
3     a     12
4     a     11
5     b     86
6     b     64
7     b     42
8     b     34
9     b     12

This results the expected data as OP desired. – Karn Kumar Sep 10 '19 at 09:53 — Karn Kumar, Sep 10 '19 at 09:53

For the same value in column a, sort values in column b and choose top 5

2 Answers2