Recently I was working on a Kaggle project "Prudential Life Insurance Assessment" where the competitors talk about changing the labels so as to get the better metric.
In that particular competition, the target has 8 classes (1-8), but one of the guy uses the different labels (-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0)
or another example they use [-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0]
instead of [1,2,3,4,5,6,7,8]
.
I was wondering how to come up with these magic numbers?
I am willing to receive any ideas/tricks/suggestions to do such transformations. Inputs are highly appreciated!
Example Code
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split
# data
df = sns.load_dataset('iris')
df['species'] = pd.factorize(df['species'])[0]
df = df.sample(frac=1,random_state=100)
# train test split
X = df.drop('species',axis=1)
y = df['species']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,stratify=y,random_state=100)
# modelling
model = xgb.XGBClassifier(objective='multi:softprob', random_state=100)
model.fit(Xtrain, ytrain)
preds = model.predict(Xtest)
kappa = metrics.cohen_kappa_score(ytest, preds, weights='quadratic')
print(kappa)
My thoughts
There are literally infinite numbers that labels can take, how to transform
[1-8]
to[x-y]
?Should we just randomly choose 8 numbers and check kappa for all of them. It seems most irrational thought and will probably never work.
Is there some kind of gradient descent method for this? Maybe not, just an idea.