Group by and find top n value_counts pandas

Question

I have a dataframe of taxi data with two columns that looks like this:

Neighborhood    Borough        Time
Midtown         Manhattan      X
Melrose         Bronx          Y
Grant City      Staten Island  Z
Midtown         Manhattan      A
Lincoln Square  Manhattan      B

Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:

df['Neighborhood'].groupby(df['Borough']).value_counts()

Which gives me something like this:

borough                          
Bronx          High  Bridge          3424
               Mott Haven            2515
               Concourse Village     1443
               Port Morris           1153
               Melrose                492
               North Riverdale        463
               Eastchester            434
               Concourse              395
               Fordham                252
               Wakefield              214
               Kingsbridge            212
               Mount Hope             200
               Parkchester            191
......

Staten Island  Castleton Corners        4
               Dongan Hills             4
               Eltingville              4
               Graniteville             4
               Great Kills              4
               Castleton                3
               Woodrow                  1

How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren't helpful to my case.

score 56 · Accepted Answer · edited Feb 15 '18 at 17:15

56

I think you can use nlargest - you can change 1 to 5:

s = df['Neighborhood'].groupby(df['Borough']).value_counts()
print s
Borough                      
Bronx          Melrose            7
Manhattan      Midtown           12
               Lincoln Square     2
Staten Island  Grant City        11
dtype: int64

print s.groupby(level=[0,1]).nlargest(1)
Bronx          Bronx          Melrose        7
Manhattan      Manhattan      Midtown       12
Staten Island  Staten Island  Grant City    11
dtype: int64

additional columns were getting created, specified level info

edited Feb 15 '18 at 17:15

Itachi

2,817
27
35

answered Feb 12 '16 at 14:18

jezrael

822,522
95
1,334
1,252

3

it is creating an extra level at l=0, just add s.index.droplevel(level=0) – Itachi Feb 15 '18 at 12:16
4

@Nemish Kanwar - Thanks good idea. Or `print s.groupby(level=0).nlargest(1).reset_index(level=0, drop=True)` – jezrael Feb 15 '18 at 12:17

score 37 · Answer 2 · answered Feb 12 '16 at 16:56

You can do this in a single line by slightly extending your original groupby with 'nlargest':

>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5)
Borough        Neighborhood    Neighborhood  
Bronx          Melrose         Melrose           1
Manhattan      Midtown         Midtown           1
Manhatten      Lincoln Square  Lincoln Square    1
               Midtown         Midtown           1
Staten Island  Grant City      Grant City        1
dtype: int64

This only get one group top 5 – Mithril May 27 '21 at 08:28 — Mithril, May 27 '21 at 08:28

Mithril · Answer 3 · 2022-07-26T02:33:37.847

14

Solution: for get topn from every group

df.groupby(['Borough']).Neighborhood.value_counts().groupby(level=0, group_keys=False).head(5)

.value_counts().nlargest(5) in other answers only give you one group top 5, doesn't make sence for me too.
group_keys=False to avoid duplicated index
because value_counts() has already sorted, just need head(5)

edited Jul 26 '22 at 02:33

answered May 27 '21 at 08:39

Mithril

12,947
18
102
153

Thank you so much, this must be the accepted answer. ```nlargest``` is just a glorified ```head()``` – abhishah901 Feb 08 '23 at 22:35

score 5 · Answer 4 · edited Apr 28 '21 at 11:31

5

df['Neighborhood'].groupby(df['Borough']).value_counts().head(5)

head() gets the top 5 rows in a data frame.

edited Apr 28 '21 at 11:31

sushanth

8,275
3
17
28

answered Dec 25 '19 at 20:17

Khadijah Lawal

85
1
3

while answering plz also add how and what does the code do `be more informative`, so that users will get more clarity by having a look at your answer – Vicky Salunkhe Dec 25 '19 at 20:24
can you add more explanation? – N. berouain Dec 25 '19 at 20:38
@VickySalunkhe The .head() gets the top 5 rows in the data frame, the default value is 5. However, any number can be used. – Khadijah Lawal Dec 27 '19 at 02:52
@N.berouain .head() gets the top 5 rows in the data frame – Khadijah Lawal Dec 27 '19 at 02:53
4

@KhadijahLawal your answer doesn't show the top 5 counts per grouping (which I believe the user is trying to get to) – devilfish_mm Jul 25 '20 at 03:07

score 1 · Answer 5 · answered Mar 29 '21 at 20:21

Try this one (just change the number in head() to your choice):

# top 3 : total counts of 'Neighborhood' in each Borough
Z = df.groupby('Borough')['Neighborhood'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()

Z

score 0 · Answer 6 · answered Oct 20 '20 at 05:31

You can also try below code to get only top 10 values of value counts

'country_code' and 'raised_amount_usd' is column names.

groupby_country_code=master_frame.groupby('country_code') arr=groupby_country_code['raised_amount_usd'].sum().sort_index()[0:10] print(arr)

[0:10] shows index 0 to 10 from array for slicing. you can choose your slicing option.

Group by and find top n value_counts pandas

6 Answers6

Solution: for get topn from every group

Linked

Related