groupby() with multiple categorical columns delivers unexpected rows

Question

I faced an important issue while using groupby() with multiple columns of type 'categorical'. In this scenario, Pandas delivers unexpected rows, especially rows that are not delivered with other types. Below is a basic working example.

df = pd.DataFrame(['a','a','b','c'], columns=['C1'], dtype='category')
df['C2'] = pd.Series(['x','y','z','y']).astype('category')
df['V'] = 0
df

gives a basic DataFrame:

    C1  C2  V
0   a   x   0
1   a   y   0
2   b   z   0
3   c   y   0

Now if group this dataframe with multiple columns:

df.groupby(['C1','C2']).sum()

The result contains unexpected rows (combinations of C1 and C2 that don't exist in the input dataframe):

If we convert 'categorical' columns to string types

df[['C1','C2']] = df[['C1','C2']].astype(str)
df.groupby(['C1','C2']).sum()

The result contains only expected rows:

Is there any other way, than converting categorical columns to string, to overcome this issue?

score 0 · Answer 1 · answered Feb 18 '23 at 03:17

0

@wjandrea suggested a working solution: an extra parameter observed=True passed to groupby():

df.groupby(['C1','C2'], observed=True).sum()

Result is delivered as expected:

answered Feb 18 '23 at 03:17

Pawel

900
2
10
19

Please don't duplicate existing solutions. There's a banner on top of your question where you can click the option something like "Yes, this answers my question." – wjandrea Feb 18 '23 at 03:18

groupby() with multiple categorical columns delivers unexpected rows

1 Answers1