0

I'm currently trying to understand certain high-level classification problems and have come across some code from a Kaggle competition that ran in 2012. The competition discussion board are (here) and the winning code is (here). At almost the end of the code at line 223 the predicted values in list of two arrays are multiplied by 0.4 and 0.6 respectively and then added together. This is the line final_pred = preds[0] * 0.4 + preds[1] * 0.6. My question is, why are the values multiplied before being returned as an array to the calling function? After the array is returned, its values are saved to CSV so no more "processing" is made. The models used are Logistic Regression and SVM.svc but this happens after all the models finish their business with the data and after the data is predicted using pred = model.predict_proba(X_test).

Can anyone please give me some information as to why this happens?

EDIT to add the function's code for completeness' sake This code is part of a longer program that predicts (binary [0,1]) text as either an insult or non-insult. The links to the original code are included in my original post.

def runClassifiers(X_train, y_train, X_test, y_test = None, verbose = True):

models = [  linear_model.LogisticRegression(C=3), 
            svm.SVC(C=0.3,kernel='linear', probability=True)]
# another two classifiers are commented out by the original author

dense = [False, False, True, True]    # if model needs dense matrix

X_train_dense = X_train.todense()
X_test_dense  = X_test.todense()

preds = []
for ndx, model in enumerate(models):
    t0 = time()
    print "Training: ", model, 20 * '_'        
    if dense[ndx]:
        model.fit(X_train_dense, y_train)
        pred = model.predict_proba(X_test_dense)    
    else:
        model.fit(X_train, y_train)
        pred = model.predict_proba(X_test)    
    print "Training time: %0.3fs" % (time() - t0)
    preds.append(array(pred[:,1]))

final_pred = preds[0]*0.4 + preds[1]*0.6
return final_pred
salvu
  • 519
  • 5
  • 14
  • 1
    I haven't looked at the entire code. But from your post, the preds lists are probabilities of the event occurring under each model. He then uses a weighted average to combine both methods (putting more weight on the preds[1]) which then returns the final weighted probabilities. – Vico Aug 25 '17 at 13:29
  • @Vico: I understand that he put more weight on the SVM as in the discussion board of the competition he said that the SVM classifier could have one the competition with the use of LR. But I'm curious as to how one decides the weight values; for example, why not 0.3 and 0.7 or would that be giving too much preference to the second classifier creating a bias? Thanks. – salvu Aug 26 '17 at 07:03

1 Answers1

2

This is just a meta-predictor using two sub-predictors (LogReg and SVM).

There are tons of approaches of combining multiple prediction-models and this convex-combination is one of the most simple ones.

The values are probably also trained with some cross-validation approach, leading to these numbers where the SVM-classifier is taken more seriously!

I'm not sure what exactly the task is, but i think the number of classes should be 2 (0 and 1 or -1 and 1; at least in this prediction-step; there might be some outer OvO or OvA scheme) to make sense here.

sascha
  • 32,238
  • 6
  • 68
  • 110
  • You are right, the program predicts values for a binary problem of [0,1]. I have included the function where the classifiers and predictions take place. Thank you for your reply. – salvu Aug 25 '17 at 16:15