How can I tell CatBoost to group together categorical values with little samples. For example, let's say I have a column called Country which has only 1 sample for 'Cambodia' and 2 samples for 'Mongolia' and 999,998 other countries each of each has at least 100 samples. I would like to tell CatBoost to not bother doing it's CTR magic on those rare countries but just treat those as "other".
Asked
Active
Viewed 59 times
1 Answers
0
Assuming you have a pandas dataframe and you have a train/test sets you wish to transform. The small code snippet will transform you low counts into 'other'. I put a threshold of 100, but you can change it to what you need!
Basically the code gets the list of values that have a low count and replaces them with the desired value.
Note: you could run .value_counts() on you column to see what there before transforming a category column.
def transform_lowcount_cat(train=train, test=test, colstoreplace=colstoreplace, replaceval = 'other', threshold=100):
for col in colstoreplace:
unique_vals_cat = pd.DataFrame(train[col].value_counts())
low_val_cat = unique_vals_cat[unique_vals_cat[col] < threshold].index.values
train[col].replace(low_val_cat, replaceval, inplace=True)
print(col + ' - TRAIN set transformed')
if test == None:
print('TEST set NOT transformed')
else:
test[col].replace(low_val_cat, replaceval, inplace=True)
print(col + ' - TEST set transformed')
And then you create a list of column/columns you want the transformation on and run the code with your desired replacement value and threshold limit. Note this does an inplace transformation.
colstoreplace = ['Col1','Col2']
transform_lowcount_cat(train=train, test=test, colstoreplace=colstoreplace, replaceval='whatever you want!', threshold = 100)

Vishal Bajaj
- 1
- 3