I have a dataframe (toy example below from another post) which you can generate with the code below ; I'd like to group by columns 'col1' and 'col2' and to count the number of occurences within each group like in this example: How to count number of rows per group (and other statistics) in pandas group by?
But to include the result directly into my dataframe like in this example (where there is only one column on which to group): Pandas, group by count and add count to original dataframe?
I have tried:
df['count'] = df.groupby(['col1','col2']).transform('count')
And:
df['count'] = df.groupby(['col1','col2'])[['col1','col2']].transform('count')
But I get the same error both times:
ValueError: Length of passed values is 10, index implies 0
Any idea how I could get around with this without having to merge the result to my initial dataframe? In R dplyr this would be quite easy with groupby
, mutate
and n()
....
Toy example:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
Code to generate toy dataframe:
import numpy as np
import pandas as pd
keys = np.array([
['A', 'B'],
['A', 'B'],
['A', 'B'],
['A', 'B'],
['C', 'D'],
['C', 'D'],
['C', 'D'],
['E', 'F'],
['E', 'F'],
['G', 'H']
])
df = pd.DataFrame(
np.hstack([keys,np.random.randn(10,4).round(2)]),
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
)
df[['col3', 'col4', 'col5', 'col6']] = df[['col3', 'col4',
'col5','col6']].astype(float)