I faced an important issue while using groupby()
with multiple columns of type 'categorical'
. In this scenario, Pandas delivers unexpected rows, especially rows that are not delivered with other types. Below is a basic working example.
df = pd.DataFrame(['a','a','b','c'], columns=['C1'], dtype='category')
df['C2'] = pd.Series(['x','y','z','y']).astype('category')
df['V'] = 0
df
gives a basic DataFrame
:
C1 C2 V
0 a x 0
1 a y 0
2 b z 0
3 c y 0
Now if group this dataframe with multiple columns:
df.groupby(['C1','C2']).sum()
The result contains unexpected rows (combinations of C1
and C2
that don't exist in the input dataframe):
V
C1 C2
a x 0
y 0
z 0
b x 0
y 0
z 0
c x 0
y 0
z 0
If we convert 'categorical'
columns to string
types
df[['C1','C2']] = df[['C1','C2']].astype(str)
df.groupby(['C1','C2']).sum()
The result contains only expected rows:
V
C1 C2
a x 0
y 0
b z 0
c y 0
Is there any other way, than converting categorical columns to string, to overcome this issue?