Using groupby in Pandas to get the top 3 rows by column value

Question

I have this dataframe:

    person_code  type   growth   size  ...
0 .         231    32     0.54     32
1 .         233    43     0.12    333
2 .         432    32     0.44     21
3 .         431    56     0.32     23
4 .         654    89     0.12     89
5 .         764    32     0.20    211
6 .         434    32     0.82     90
...

(This dataframe is pretty big, I made a simplification here)

I want to create one dataframe for each type with the 3 persons with higher "growth", ordered by it. I want to be able to call it by type. In this case, let's use the type 32, so the output df should look something like this:

    person_code  type   growth   size  ...
6 .         434    32     0.82     90
0 .         231    32     0.54     32
2 .         432    32     0.44     21
...

I understand that it would be something using groupby:

groups=dataframe.groupby('type')

But how could I call the groupby object with the rows where type is 32? And what would be the best what to separate only the top 3 by growth?

score 2 · Accepted Answer · edited Mar 05 '18 at 00:16

2

IIUC, you don't need a groupby, just query to filter the dataframe then nlargest:

df.query('type == 32').nlargest(3, 'growth')

And, to parameterize 'type' input, you can use this syntax:

in_type = 32

df.query('type == @in_type').nlargest(3, 'growth')

Output:

     person_code  type  growth  size
6 .          434    32    0.82    90
0 .          231    32    0.54    32
2 .          432    32    0.44    21

Or if you want to use groupby, you can use query to get only the types you need.

type_group_df = df.groupby('type', group_keys=False)\
                  .apply(pd.DataFrame.nlargest,n=3,columns='growth')

To call it, you can use:

type_group_df.query('type == 32')

If you've got a string as type it would look like this:

type_group_df.query('type == "brazilian"')

However, if by any chance your column name start with special characters, such as '#', you should use this:

type_group_df[type_group_df['#type'] == 32]

Output:

     person_code  type  growth  size
6 .          434    32    0.82    90
0 .          231    32    0.54    32
2 .          432    32    0.44    21

Query another type (43):

type_group_df.query('type == 43')

Output:

     person_code  type  growth  size
1 .          233    43    0.12   333

edited Mar 05 '18 at 00:16

aabujamra

4,494
13
51
101

answered Mar 04 '18 at 23:19

Scott Boston

147,308
15
139
187

Does that work if type is actually a string and not a number? Like "brazilian" for example. – aabujamra Mar 04 '18 at 23:53
Yes you have to use double quotes("brazilian") inside of single quotes, like this, `type_group_df.query('type == "brazilian"')` – Scott Boston Mar 04 '18 at 23:55
I did it and I'm getting ValueError: expr must be a string to be evaluated, given :/ Does that work for for python 3.5 too? – aabujamra Mar 04 '18 at 23:59
Yes, what are the exactly lines you are trying? – Scott Boston Mar 05 '18 at 00:00
Oh, I was making a mistake in the synthax, I was like cnae_players.query('#CNAE' == '"64.34-4-00"'). Then I fixed to cnae_players.query('#CNAE == "64.34-4-00"')) and got another error: SyntaxError: only a single expression is allowed – aabujamra Mar 05 '18 at 00:04
1

Oh... query will not work with those column names starting with a special character or containing a space. You need to revert back to boolean indexing such as df[df['#CNAE'] == "64.34-4-00"] – Scott Boston Mar 05 '18 at 00:06

score 2 · Answer 2 · answered Mar 04 '18 at 23:20

You can do this for all of the types at the same time:

df.groupby('type').apply(lambda dft: dft.nlargest(3, 'growth'))

returns

        person_code  type  growth  size
type                                   
32   6          434    32    0.82    90
     0          231    32    0.54    32
     2          432    32    0.44    21
43   1          233    43    0.12   333
56   3          431    56    0.32    23
89   4          654    89    0.12    89

score 1 · Answer 3 · answered Mar 04 '18 at 23:20

Something like ?

df.sort_values(['type','person_code']).groupby('type').head(3)
Out[184]: 
   person_code  type  growth  size
0          231    32    0.54    32
2          432    32    0.44    21
6          434    32    0.82    90
1          233    43    0.12   333
3          431    56    0.32    23
4          654    89    0.12    89

Tai · Answer 4 · 2018-03-05T00:13:10.520

Find the indices of the top 3 growth values for each group and feed the level-1 indices into .loc.

idx = df.groupby("type")["growth"].nlargest(3).index

# MultiIndex(levels=[[32, 43, 56, 89], [0, 1, 2, 3, 4, 6]],
#           labels=[[0, 0, 0, 1, 2, 3], [5, 0, 2, 1, 3, 4]],
#           names=['type', None])

dftop3 = df.loc[idx.get_level_values(1)]

    person_code type    growth  size
6   434         32       0.82   90
0   231         32       0.54   32
2   432         32       0.44   21
1   233         43       0.12   333
3   431         56       0.32   23
4   654         89       0.12   89

dftop3[dftop3.type == 32]

person_code type    growth  size
6   434      32     0.82    90
0   231      32     0.54    32
2   432      32     0.44    21

How could I call it to get only the ones from type 32? – aabujamra Mar 04 '18 at 23:55 — aabujamra, Mar 04 '18 at 23:55

Using groupby in Pandas to get the top 3 rows by column value

4 Answers4

Linked