Groupby value counts on the dataframe pandas

Question

I have the following dataframe:

df = pd.DataFrame([
    (1, 1, 'term1'),
    (1, 2, 'term2'),
    (1, 1, 'term1'),
    (1, 1, 'term2'),
    (2, 2, 'term3'),
    (2, 3, 'term1'),
    (2, 2, 'term1')
], columns=['id', 'group', 'term'])

I want to group it by id and group and calculate the number of each term for this id, group pair.

So in the end I am going to get something like this:

I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).

It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts() which does not work because value_counts operates on the groupby series and not a dataframe.

Anyway I can achieve this without looping?

score 171 · Accepted Answer · edited Jun 20 '20 at 09:12

171

I use groupby and size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

Timing

1,000,000 rows

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
                       group=np.random.choice(20, 1000000),
                       term=np.random.choice(10, 1000000)))

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 24 '16 at 20:57

piRSquared

285,575
57
475
624

2

@jezrael thx, `size` is quicker too. `crosstab` is oddly inefficient – piRSquared Aug 24 '16 at 21:02
And I am surprised that `crosstab` is so lazy ;) – jezrael Aug 24 '16 at 21:09
@jezrael, `crosstab` uses `pivot_table` internally... ;) – MaxU - stand with Ukraine Aug 24 '16 at 21:12
@piRSquared - can you add to timings `df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)` ? It seems faster for me. Thanks. – jezrael Aug 24 '16 at 22:03
@piRSquared - I try it in larger df and a bit faster (0.2ms, maybe it is same ;)) – jezrael Aug 24 '16 at 22:08
@jezrael all are same except for `crosstab` – piRSquared Aug 24 '16 at 22:14
@piRSquared How would you create the same table except with proportions of respective classes? – user2205916 Jan 20 '21 at 17:50

MaxU - stand with Ukraine · Answer 2 · 2016-08-25T07:17:10.660

29

using pivot_table() method:

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
Out[22]:
term      term1  term2  term3
id group
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timing against 700K rows DF:

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)

In [25]: df.shape
Out[25]: (700000, 3)

In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop

In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop

In [5]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 355 ms per loop

In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop

In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

In [9]: df = pd.concat([df] * 10, ignore_index=True)

In [10]: df.shape
Out[10]: (7000000, 3)

In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop

In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop

In [13]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 3.37 s per loop

In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop

In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

edited Aug 25 '16 at 07:17

answered Aug 24 '16 at 20:53

MaxU - stand with Ukraine

205,989
36
386
419

1

I was just trying to update timings with larger sample :-) – piRSquared Aug 24 '16 at 21:07
wow! pivot seems just as efficient at larger scales. I'll have to remember that. I'd give you +1 but I already did a while ago. – piRSquared Aug 24 '16 at 21:08
So `size` was the alias that we forgot [here](http://stackoverflow.com/a/38279370/2285236) :) – ayhan Aug 24 '16 at 21:10
@ayhan, very strange - this time the solution with `df.assign(ones = np.ones(len(df))).pivot_table(index=['id','group'], columns='term', values = 'ones', aggfunc=np.sum, fill_value=0)` is bit slower - `1 loop, best of 3: 2.55 s per loop` – MaxU - stand with Ukraine Aug 24 '16 at 21:16
I think it is because you used `len` there, instead of 'size'. `len` is a Python function but the functions we pass as strings are aliases to optimized C functions. – ayhan Aug 24 '16 at 21:18
@ayhan, i have the same timing if i use `df.shape[0]` instead of `len(df)` – MaxU - stand with Ukraine Aug 24 '16 at 21:21
@MaxU - can you add to timings `df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)` ? Thanks. – jezrael Aug 24 '16 at 22:03
@jezrael, i've added a timing for this solution – MaxU - stand with Ukraine Aug 25 '16 at 07:17
@MaxU - very interesting, Iget same timings with fastest method, but you get different. Thank you for timings. – jezrael Aug 25 '16 at 07:19

score 26 · Answer 3 · answered Aug 24 '16 at 21:46

26

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:

df.groupby(['id', 'group', 'term']).count()

answered Aug 24 '16 at 21:46

A.Kot

7,615
2
22
24

Maybe this used to work before, but it doesn't return any columns in pandas 1.5.2 – ali bakhtiari Jan 05 '23 at 13:39
@alibakhtiari, would love to see what columns your dataframe has, groupby count has been working since python existed and still does. – A.Kot Mar 17 '23 at 15:35

jezrael · Answer 4 · 2016-08-24T22:02:25.997

17

You can use crosstab:

print (pd.crosstab([df.id, df.group], df.term))
term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Another solution with groupby with aggregating size, reshaping by unstack:

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timings:

df = pd.concat([df]*10000).reset_index(drop=True)

In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop

In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop

edited Aug 24 '16 at 22:02

answered Aug 24 '16 at 20:47

jezrael

822,522
95
1,334
1,252

1

wow wow wow, you are amazing. And it took you only 3 minutes (the same time it took me to write a loop, and less time then it took me to write this question). I would really appreciate if you can write some explanation of why this works, but most probably I will be able to understand it by myself in a few minutes. – Salvador Dali Aug 24 '16 at 20:53
In your case `crosstab` is better as `pivot_table`, because default aggregating function is `len` (it is same as `size`) and I think it is also faster solution. `Crosstab` use first argument as `index` and `second` of columns. Give me a time, I try add timings. – jezrael Aug 24 '16 at 20:57
But I think better it is explain in [`docs`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#cross-tabulations). – jezrael Aug 24 '16 at 20:58

score 6 · Answer 5 · answered Oct 14 '21 at 15:24

If you want to use value_counts you can use it on a given series, and resort to the following:

df.groupby(["id", "group"])["term"].value_counts().unstack(fill_value=0)

or in an equivalent fashion, using the .agg method:

df.groupby(["id", "group"]).agg({"term": "value_counts"}).unstack(fill_value=0)

Another option is to directly use value_counts on the DataFrame itself without resorting to groupby:

df.value_counts().unstack(fill_value=0)

score 0 · Answer 6 · answered Jan 05 '23 at 13:21

Another alternative:

df.assign(count=1).groupby(['id', 'group','term']).sum().unstack(fill_value=0).xs("count", 1)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Groupby value counts on the dataframe pandas

6 Answers6

Timing

Linked

Related