Pandas groupby unique issue

Question

I have a dataframe 'region_group'. As shown below, this dataframe does not have 'ARTHOG' value in 'Town/City' column. However when I do groupby-first, on this column, this value pops back in. I am trying to understand why this is happening.

Note: region_group dataframe is based on another dataframe which has 'ARTHOG' as value in 'Town/City' column. But it has been filtered out with where condition as shown below and as also evident in the Out[25]

region=k[['my_ID','Town/City','District','County','month','year']]
region=region.loc[(region['month'] == 12) & (region['year'] == 2016)]
region_noid=region.drop(['my_ID','month','year'], axis=1)

region_group=region_noid.groupby(['Town/City','District','County']).size().reset_index(name='Count')

What `dtype` is your "Town/City" column? Just ruling out categoricals. — jpp, Feb 09 '18 at 01:34
region_group.dtypes Out[29]: Town/City category District category County category Count int64 dtype: object — itthrill, Feb 09 '18 at 01:40

BENY · Answer 1 · 2018-02-09T01:58:59.553

1

Category data will carry over the category , when there is no value , will still keeping the category but fill the value as NaN

df=pd.DataFrame({'A':[1,1,3,4,5],'B':[1,2,2,2,2]})
df.A=df.A.astype('category',categories=[1,2,3,4,5])

df.groupby('A').B.first()
Out[905]: 
A
1    1.0
2    NaN
3    2.0
4    2.0
5    2.0
Name: B, dtype: float64

Solution , convert it back to str or numeric

df.A=df.A.astype(int)
df.groupby('A').B.first()
Out[907]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64

Or we are using remove_unused_categories

df.A=df.A.cat.remove_unused_categories()
df.groupby('A').B.first()
Out[910]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64

edited Feb 09 '18 at 01:58

answered Feb 09 '18 at 01:45

BENY

317,841
20
164
234

wen and jp_data_analysis. Thanks to both of you. Its great to know this. You both are right and I can accept only one answer. I have accepted jp_data_analysis answer because he answered first. Thanks again. – itthrill Feb 09 '18 at 01:48
1

@MadhukarJha Let is ok, Let me offer more option to you – BENY Feb 09 '18 at 01:57

score 0 · Accepted Answer · answered Feb 09 '18 at 01:43

Pandas uses the product of all categorical columns in groupby operations to determine the index of the output. This means that even if a category is not represented in the underlying data, it will be represented in groupby results.

Details of this, as well as possible solutions, can be found in my question challenging the purpose of this behaviour: Pandas groupby with categories

The pandas development team have taken the stance that all combinations of categories must be representing in groupby operations on categorical series.

score 0 · Answer 3 · answered May 29 '18 at 08:27

0

Since Pandas 0.23.0, the groupby method can now take a parameter "observed" which fixes this issue if it is set to True (False by default).

answered May 29 '18 at 08:27

Ismael EL ATIFI

1,939
20
16

Pandas groupby unique issue

3 Answers3