Getting the result of groupby multiple columns and count directly in my dataframe

Question

I have a dataframe (toy example below from another post) which you can generate with the code below ; I'd like to group by columns 'col1' and 'col2' and to count the number of occurences within each group like in this example: How to count number of rows per group (and other statistics) in pandas group by?

But to include the result directly into my dataframe like in this example (where there is only one column on which to group): Pandas, group by count and add count to original dataframe?

I have tried:

df['count'] = df.groupby(['col1','col2']).transform('count')

And:

df['count'] = df.groupby(['col1','col2'])[['col1','col2']].transform('count')

But I get the same error both times:

ValueError: Length of passed values is 10, index implies 0

Any idea how I could get around with this without having to merge the result to my initial dataframe? In R dplyr this would be quite easy with groupby, mutate and n()....

Toy example:

  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

Code to generate toy dataframe:

import numpy as np
import pandas as pd 

keys = np.array([
   ['A', 'B'],
   ['A', 'B'],
     ['A', 'B'],
  ['A', 'B'],
     ['C', 'D'],
   ['C', 'D'],
   ['C', 'D'],
   ['E', 'F'],
   ['E', 'F'],
   ['G', 'H'] 
   ])

df = pd.DataFrame(
np.hstack([keys,np.random.randn(10,4).round(2)]), 
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
)

df[['col3', 'col4', 'col5', 'col6']] = df[['col3', 'col4', 
'col5','col6']].astype(float)

What's the desired output ? Your question is unclear to me since by aggregating, you're having a diminished DataFrame, and therefore can't merge it with the previous one. Depending on the desired output, that might be feasible though. — IMCoins, Jan 29 '19 at 19:17
The issue with your first method is that `transform('count')` returns a `DataFrame` which you are trying to assign to a single column, so that wont happen. In the second case that remains unchanged, you just have a 2 column `DataFrame` instead of 4. — ALollz, Jan 29 '19 at 19:51

Erfan · Accepted Answer · 2019-01-29T20:05:38.837

EDIT

After @Alollz comment, the code can be reduced to a single line, without the concatenation: df['count'] = df.groupby(['col1', 'col2'])['col1'].transform('size')

Answer before the edit

The workaround would be to make a concatenated column before applying the groupby.

After that you can use the size function inside your transform method.

Maybe not the most elegant, but it works.

Code

# make a concatenated column 
df['concat'] = df.col1 + df.col2

# perform the transformation asked
df['count'] = df.groupby(['concat'])['concat'].transform('size')

# drop the concat column
df.drop('concat', axis=1, inplace=True)


    col1col2col3    col4    col5    col6    count
0   A   B   -0.62   0.09    0.92    1.45    4
1   A   B   1.49    0.31    -0.40   -0.99   4
2   A   B   -1.30   1.01    -0.31   -0.32   4
3   A   B   -0.05   0.32    -1.45   -0.54   4
4   C   D   -1.30   1.26    1.80    0.50    3
5   C   D   0.75    0.13    0.49    -2.37   3
6   C   D   1.05    -0.96   -0.44   -1.00   3
7   E   F   0.31    -0.93   -1.78   -1.49   2
8   E   F   -0.23   0.30    0.77    1.46    2
9   G   H   -0.67   0.88    -0.26   -1.09   1

Hope this helps.

The `size` of everything within a group is the same since it includes `NaN` and DataFrames must be rectangular. (If you do `df.groupby(['col1','col2']).agg('size')`, you'll notice it only returns a single Series, unlike the DataFrame it returns for things like `.agg('count')`. So just pick any column you want to transform `df.groupby(['col1','col2']).col1.transform('size')` — ALollz, Jan 29 '19 at 20:02

IMCoins · Answer 2 · 2019-01-29T19:29:35.030

If you stack() the result of your groupby(), you will have a result that fits the length of your DataFrame's rows. Then, reset_index() will make it fit the previous DataFrame.

df['count'] = df.groupby(['col1', 'col2']).count().stack().reset_index()[0]

Will give you...

import numpy as np
import pandas as pd 

keys = np.array([
   ['A', 'B'],
   ['A', 'B'],
     ['A', 'B'],
  ['A', 'B'],
     ['C', 'D'],
   ['C', 'D'],
   ['C', 'D'],
   ['E', 'F'],
   ['E', 'F'],
   ['G', 'H'] 
   ])

df = pd.DataFrame(
np.hstack([keys,np.random.randn(10,4).round(2)]), 
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
)

df[['col3', 'col4', 'col5', 'col6']] = df[['col3', 'col4', 
'col5','col6']].astype(float)

df_count = df.groupby(['col1', 'col2']).count()
print(df_count)
#            col3  col4  col5  col6
# col1 col2                        
# A    B        4     4     4     4
# C    D        3     3     3     3
# E    F        2     2     2     2
# G    H        1     1     1     1
stacked_count = df_count.stack()
print(stacked_count)
# col1  col2      
# A     B     col3    4
#             col4    4
#             col5    4
#             col6    4
# C     D     col3    3
#             col4    3
#             col5    3
#             col6    3
# E     F     col3    2
#             col4    2
#             col5    2
#             col6    2
# G     H     col3    1
#             col4    1
#             col5    1
#             col6    1
# dtype: int64

ndf = stacked_count.reset_index()[0]
print(ndf)
# 0     4
# 1     4
# 2     4
# 3     4
# 4     3
# 5     3
# 6     3
# 7     3
# 8     2
# 9     2
# 10    2
# 11    2
# 12    1
# 13    1
# 14    1
# 15    1
# Name: 0, dtype: int64

df['count'] = ndf
print(df)
#   col1 col2  col3  col4  col5  col6  count
# 0    A    B -0.81 -1.39  0.07  0.71      4
# 1    A    B -1.37  1.72 -2.04  0.83      4
# 2    A    B -1.81 -0.53 -1.27 -0.83      4
# 3    A    B -1.05 -0.06 -1.78  0.81      4
# 4    C    D -1.40  0.36  0.94  1.90      3
# 5    C    D  0.65  0.49 -1.78  0.10      3
# 6    C    D -0.57 -0.08  1.11 -0.73      3
# 7    E    F  0.37  0.99 -1.62 -1.26      3
# 8    E    F  0.09 -0.03  1.27  0.80      2
# 9    G    H  0.26  0.71  0.05  0.48      2

Getting the result of groupby multiple columns and count directly in my dataframe

2 Answers2