2

New to pandas I want to perform something similar to Reduce number of levels for large categorical variables (binning of categorical variables in order to reduce their levels) The following code works fine in R

DTsetlvls <- function(x, newl)  
   setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

My dataframe:

df = pd.DataFrame({'Color': 'Red Red Blue'.split(),
                   'Value': [100, 150, 50]})

df['Counts'] = df.groupby('Color')['Value'].transform('count')
print (df)

  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

I manually would create an aggregate column and then based on that, label the less frequent groups e.g. "blue" as a single "other" group. But compared to the concise R code this seems clumsy. What would be the right approach here?

smci
  • 32,567
  • 20
  • 113
  • 146
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Possible duplicate of [How to group "remaining" results beyond Top N into "Others" with pandas](http://stackoverflow.com/questions/19835746/how-to-group-remaining-results-beyond-top-n-into-others-with-pandas) – Ami Tavory Aug 23 '16 at 11:27

1 Answers1

6

You can use value_counts with numpy.where, where is condition with isin.

If your variable is of type object see below. If your variable is of type category, then skip down toward the bottom.

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                   'Value':[11,150,50,30,10,40]})
print (df)
    Color  Value
0     Red     11
1     Red    150
2    Blue     50
3     Red     30
4  Violet     10
5    Blue     40

a = df.Color.value_counts()
print (a)
Red       3
Blue      2
Violet    1
Name: Color, dtype: int64

#get top 2 values of index
vals = a[:2].index
print (vals)
Index(['Red', 'Blue'], dtype='object')

df['new'] = np.where(df.Color.isin(vals), 0,1)
print (df)
    Color  Value  new
0     Red     11    0
1     Red    150    0
2    Blue     50    0
3     Red     30    0
4  Violet     10    1
5    Blue     40    0

Or if need replace all not top values use where:

df['new1'] = df.Color.where(df.Color.isin(vals), 'other')
print (df)
    Color  Value   new1
0     Red     11    Red
1     Red    150    Red
2    Blue     50   Blue
3     Red     30    Red
4  Violet     10  other
5    Blue     40   Blue

For category type:

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                   'Value':[11,150,50,30,10,40]})
df.Color = df.Color.astype('category')

a= df.Color.value_counts()[:2].index
print(a)
CategoricalIndex(['Red', 'Blue'], 
                categories=['Blue', 'Red', 'Violet'], 
                ordered=False, dtype='category')

Notice that violet is still a category. So we need .remove_unused_categories().

vals = df.Color.value_counts()[:2].index.remove_unused_categories()
CategoricalIndex(['Red', 'Blue'], 
                 categories=['Blue', 'Red'], 
                 ordered=False, dtype='category')

As mentioned in the comments, a ValueError will occur when setting the new variable. The way around that is type changing.

df['new1'] = df.Color.astype('object').where(df.Color.isin(vals), 'other')
df['new1'] = df['new1'].astype('category')
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I get this error `ValueError: Cannot setitem on a Categorical with a new category, set the categories first`. This is due to my variable being categorical already. My fix was adding `.astype('object')` like this `df.Color.astype('object').where(df.Color.isin(vals), 'other')`. – Climbs_lika_Spyder May 20 '20 at 16:22