How to choose or optimize the labels so that we get better multiclass classification results?

Question

Recently I was working on a Kaggle project "Prudential Life Insurance Assessment" where the competitors talk about changing the labels so as to get the better metric.

In that particular competition, the target has 8 classes (1-8), but one of the guy uses the different labels (-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0) or another example they use [-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0] instead of [1,2,3,4,5,6,7,8].

I was wondering how to come up with these magic numbers?

I am willing to receive any ideas/tricks/suggestions to do such transformations. Inputs are highly appreciated!

Example Code

# imports
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split

# data
df = sns.load_dataset('iris')
df['species'] = pd.factorize(df['species'])[0]
df = df.sample(frac=1,random_state=100)

# train test split
X = df.drop('species',axis=1)
y = df['species']
Xtrain,  Xtest, ytrain, ytest = train_test_split(X,y,stratify=y,random_state=100)

# modelling
model = xgb.XGBClassifier(objective='multi:softprob', random_state=100)
model.fit(Xtrain, ytrain)
preds = model.predict(Xtest)
kappa = metrics.cohen_kappa_score(ytest, preds, weights='quadratic')

print(kappa)

My thoughts

There are literally infinite numbers that labels can take, how to transform [1-8] to [x-y]?
Should we just randomly choose 8 numbers and check kappa for all of them. It seems most irrational thought and will probably never work.
Is there some kind of gradient descent method for this? Maybe not, just an idea.

Reference Links

I think you are seeing them do this hardcoding so the can use 'regression' modeling instead of 'classification'. — Scott Boston, Jun 23 '20 at 18:31
Yes, they are hardcoding the values, but I was interested in a more scientific way how to "hardcode" these good values. — BhishanPoudel, Jun 24 '20 at 14:11

Alexander Pivovarov · Accepted Answer · 2020-06-25T16:42:21.920

The very first link in your question actually contains the answer:

#The hardcoded values were obtained by optimizing a CV score using simulated annealing

Also later the author comments:

At first I was optimising the parameters one by one but then I switched to optimising them simultaneously by a combination of grid search and simulated annealing. I am not sure I found a global maximum of the CV score though, even after playing around with various settings of the simulated annealing. Maybe genetic algorithms would help.

The second link's solution has the same values because (likely) the author copied them from the first solution (see in their comments):

Inspired by: https://www.kaggle.com/mariopasquato/prudential-life-insurance-assessment/linear-model/code

To put it simply - you can just treat these values as if they are metaparameters of your learning algorithm (well, they are). This way you can define a function F(metaparameters) such that to compute single value of it you do full training on your training set and output loss on validation set (or better just use n-fold cross validation and use CV loss). Then your task becomes pretty much to optimize function F in a way to find the best set of metaparameters using whatever optimization method you like - e.g. the first solution's author claims they used grid search and simulated annealing.

Small example with no meta-tuning for the optimization itself:

import numpy as np
cnt = 0
def use_a_function_which_calls_training_and_computes_cv_instead_of_this(x):
    global cnt
    cnt += 1
    return ((x - np.array([-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0]))**2).sum()

my_best_guess_for_the_initial_parameters = np.array([1.,2.,3.,4.,5.,6.,7.,8.])
optimization_results = scipy.optimize.basinhopping(
    use_a_function_which_calls_training_and_computes_cv_instead_of_this,
    my_best_guess_for_the_initial_parameters,
    niter=100)
print("Times function was called: {0}".format(cnt))
print(optimization_results.x)

Example output:

Times function was called: 3080
[-1.6         0.7         0.3         3.15        4.52999999  6.5
  6.77        8.99999999]

You will quite possibly want to experiment with the parameters of the optimization itself, maybe even write your custom optimizer and/or callback for making steps. But it also possible that even default parameters will work for you at least to some degree. If you find the time to make one calculation of the function too much you can e.g. try to do initial optimization with a smaller subset of your full data, etc.

Obviously I had read all the comments and kaggle discussion about the topic from the both authors. I was just curious how do you perform these grid search and simulated annealing. If you accompany the answer with some `code`, I will happily accept your answer. — BhishanPoudel, Jun 25 '20 at 15:27
Just to clarify - the example code you're requesting is just for e.g. simulated annealing itself, not for how to compute loss using cross-validation, etc.? — Alexander Pivovarov, Jun 25 '20 at 15:42
Yeah man, I appreciate your effort, simulated annealing is just one of the method of grid search. I was more interested in how to choose initial values, such as `[1,2,3]` and getting more rational values. There are literally infinite choice for grid search. For example in linear regression alpha we can choose log scale `1e-5, 1e-4, 1, 10` etc, but here its complete range of real numbers. You have any suggestions how to start from initial guess, lets say `[1,2,3]` then choosing rational hyper parameters? — BhishanPoudel, Jun 25 '20 at 16:01
I don't want to disappoint you, but there is not magic pill here. If you know that your final result supposed to be used in a way where absolute values on their own make sense - you can just start with [1,2,3,4,5,6,7,8]. If you know that these are e.g. probabilities you might want to chose some uniformly spaced interval in logit space, if these are log-scale, then you can choose uniformly spaced values and apply exp to them (and optimize in corresponding space - i.e. uniform, logit or exponential-or-log whatever you might want to call it). — Alexander Pivovarov, Jun 25 '20 at 16:22
I can absolutely add an example of how to use `scipy.optimize.basinhopping` (which is replacing `anneal` in the latest scipy release) to my answer on an artificial example, but not sure if that is what you want. — Alexander Pivovarov, Jun 25 '20 at 16:24
@astro123 , updated my answer with a toy example of optimization using scipy. — Alexander Pivovarov, Jun 25 '20 at 18:15

How to choose or optimize the labels so that we get better multiclass classification results?

Example Code

My thoughts

Reference Links

1 Answers1