0

How can I tell CatBoost to group together categorical values with little samples. For example, let's say I have a column called Country which has only 1 sample for 'Cambodia' and 2 samples for 'Mongolia' and 999,998 other countries each of each has at least 100 samples. I would like to tell CatBoost to not bother doing it's CTR magic on those rare countries but just treat those as "other".

Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66

1 Answers1

0

Assuming you have a pandas dataframe and you have a train/test sets you wish to transform. The small code snippet will transform you low counts into 'other'. I put a threshold of 100, but you can change it to what you need!

Basically the code gets the list of values that have a low count and replaces them with the desired value.

Note: you could run .value_counts() on you column to see what there before transforming a category column.

def transform_lowcount_cat(train=train, test=test, colstoreplace=colstoreplace, replaceval = 'other',  threshold=100): 
  for col in colstoreplace:
      unique_vals_cat = pd.DataFrame(train[col].value_counts())
      low_val_cat = unique_vals_cat[unique_vals_cat[col] < threshold].index.values
      train[col].replace(low_val_cat, replaceval, inplace=True)
      print(col + ' - TRAIN set transformed')
      if test == None:
        print('TEST set NOT transformed')
      else:
        test[col].replace(low_val_cat, replaceval, inplace=True)
        print(col + ' - TEST set transformed')

And then you create a list of column/columns you want the transformation on and run the code with your desired replacement value and threshold limit. Note this does an inplace transformation.

colstoreplace = ['Col1','Col2']
transform_lowcount_cat(train=train, test=test, colstoreplace=colstoreplace, replaceval='whatever you want!', threshold = 100)