0

I faced an important issue while using groupby() with multiple columns of type 'categorical'. In this scenario, Pandas delivers unexpected rows, especially rows that are not delivered with other types. Below is a basic working example.

df = pd.DataFrame(['a','a','b','c'], columns=['C1'], dtype='category')
df['C2'] = pd.Series(['x','y','z','y']).astype('category')
df['V'] = 0
df

gives a basic DataFrame:

    C1  C2  V
0   a   x   0
1   a   y   0
2   b   z   0
3   c   y   0

Now if group this dataframe with multiple columns:

df.groupby(['C1','C2']).sum()

The result contains unexpected rows (combinations of C1 and C2 that don't exist in the input dataframe):

        V
C1  C2  
a   x   0
    y   0
    z   0
b   x   0
    y   0
    z   0
c   x   0
    y   0
    z   0

If we convert 'categorical' columns to string types

df[['C1','C2']] = df[['C1','C2']].astype(str)
df.groupby(['C1','C2']).sum()

The result contains only expected rows:

        V
C1  C2  
a   x   0
    y   0
b   z   0
c   y   0

Is there any other way, than converting categorical columns to string, to overcome this issue?

Ken White
  • 123,280
  • 14
  • 225
  • 444
Pawel
  • 900
  • 2
  • 10
  • 19

1 Answers1

0

@wjandrea suggested a working solution: an extra parameter observed=True passed to groupby():

df.groupby(['C1','C2'], observed=True).sum()

Result is delivered as expected:

        V
C1  C2  
a   x   0
    y   0
b   z   0
c   y   0
Pawel
  • 900
  • 2
  • 10
  • 19
  • Please don't duplicate existing solutions. There's a banner on top of your question where you can click the option something like "Yes, this answers my question." – wjandrea Feb 18 '23 at 03:18