1

I have a dataframe 'region_group'. As shown below, this dataframe does not have 'ARTHOG' value in 'Town/City' column. However when I do groupby-first, on this column, this value pops back in. I am trying to understand why this is happening.

Note: region_group dataframe is based on another dataframe which has 'ARTHOG' as value in 'Town/City' column. But it has been filtered out with where condition as shown below and as also evident in the Out[25]

region=k[['my_ID','Town/City','District','County','month','year']]
region=region.loc[(region['month'] == 12) & (region['year'] == 2016)]
region_noid=region.drop(['my_ID','month','year'], axis=1)

region_group=region_noid.groupby(['Town/City','District','County']).size().reset_index(name='Count')

enter image description here

itthrill
  • 1,241
  • 2
  • 17
  • 36

3 Answers3

1

Category data will carry over the category , when there is no value , will still keeping the category but fill the value as NaN

df=pd.DataFrame({'A':[1,1,3,4,5],'B':[1,2,2,2,2]})
df.A=df.A.astype('category',categories=[1,2,3,4,5])

df.groupby('A').B.first()
Out[905]: 
A
1    1.0
2    NaN
3    2.0
4    2.0
5    2.0
Name: B, dtype: float64

Solution , convert it back to str or numeric

df.A=df.A.astype(int)
df.groupby('A').B.first()
Out[907]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64

Or we are using remove_unused_categories

df.A=df.A.cat.remove_unused_categories()
df.groupby('A').B.first()
Out[910]: 
A
1    1
3    2
4    2
5    2
Name: B, dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • wen and jp_data_analysis. Thanks to both of you. Its great to know this. You both are right and I can accept only one answer. I have accepted jp_data_analysis answer because he answered first. Thanks again. – itthrill Feb 09 '18 at 01:48
  • 1
    @MadhukarJha Let is ok, Let me offer more option to you – BENY Feb 09 '18 at 01:57
0

Pandas uses the product of all categorical columns in groupby operations to determine the index of the output. This means that even if a category is not represented in the underlying data, it will be represented in groupby results.

Details of this, as well as possible solutions, can be found in my question challenging the purpose of this behaviour: Pandas groupby with categories

The pandas development team have taken the stance that all combinations of categories must be representing in groupby operations on categorical series.

jpp
  • 159,742
  • 34
  • 281
  • 339
0

Since Pandas 0.23.0, the groupby method can now take a parameter "observed" which fixes this issue if it is set to True (False by default).

Ismael EL ATIFI
  • 1,939
  • 20
  • 16