90

This would be useful so I know how many unique groups I have to perform calculations on. Thank you.

Suppose groupby object is called dfgroup.

cs95
  • 379,657
  • 97
  • 704
  • 746
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90

2 Answers2

117

Simple, Fast, and Pandaic: ngroups

Newer versions of the groupby API (pandas >= 0.23) provide this (undocumented) attribute which stores the number of groups in a GroupBy object.

# setup
df = pd.DataFrame({'A': list('aabbcccd')})
dfg = df.groupby('A')
# call `.ngroups` on the GroupBy object
dfg.ngroups
# 4

Note that this is different from GroupBy.groups which returns the actual groups themselves.

Why should I prefer this over len?

As noted in BrenBarn's answer, you could use len(dfg) to get the number of groups. But you shouldn't. Looking at the implementation of GroupBy.__len__ (which is what len() calls interally), we see that __len__ makes a call to GroupBy.groups, which returns a dictionary of grouped indices:

dfg.groups
{'a': Int64Index([0, 1], dtype='int64'),
 'b': Int64Index([2, 3], dtype='int64'),
 'c': Int64Index([4, 5, 6], dtype='int64'),
 'd': Int64Index([7], dtype='int64')}

Depending on the number of groups in your operation, generating the dictionary only to find its length is a wasteful step. ngroups on the other hand is a stored property that can be accessed in constant time.

This has been documented in GroupBy object attributes. The issue with len, however, is that for a GroupBy object with a lot of groups, this can take a lot longer

But what if I actually want the size of each group?

You're in luck. We have a function for that, it's called GroupBy.size. But please note that size counts NaNs as well. If you don't want NaNs counted, use GroupBy.count instead.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • @U9-Forward Thanks! It isn't a popular question (relatively speaking) but I assume the upvotes here mean the answer is useful. I still feel like I can make improvements so I'll look into that in a bit. – cs95 May 17 '19 at 05:28
  • 1
    You deserve a little more i guess, `ngroups` is clever :-) – U13-Forward May 17 '19 at 05:29
  • 3
    Note `len(g)` can be *VERY* slow the first time it is called if there are a large number of groups!! IPython caches the result thereafter, but `g.ngroups` is always fast since it is stored as an attribute. – Bernie Roesler Aug 22 '19 at 20:19
  • I guess a potential downside to `ngroups` is that being undocumented, it may break without notice in future pandas versions? – James Hirschorn Feb 05 '23 at 07:22
66

As documented, you can get the number of groups with len(dfgroup).

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 16
    As noted below, using `len(dfgroup)` can be very slow, especially for large number of groups. `dfgroup.ngroups` is the fastest way to get this, as this is a stored value! – Shuchita Banthia Nov 09 '19 at 11:12