I'm currently trying to understand certain high-level classification problems and have come across some code from a Kaggle competition that ran in 2012. The competition discussion board are (here) and the winning code is (here). At almost the end of the code at line 223 the predicted values in list of two arrays are multiplied by 0.4 and 0.6 respectively and then added together. This is the line final_pred = preds[0] * 0.4 + preds[1] * 0.6
. My question is, why are the values multiplied before being returned as an array to the calling function? After the array is returned, its values are saved to CSV so no more "processing" is made. The models used are Logistic Regression and SVM.svc but this happens after all the models finish their business with the data and after the data is predicted using pred = model.predict_proba(X_test)
.
Can anyone please give me some information as to why this happens?
EDIT to add the function's code for completeness' sake This code is part of a longer program that predicts (binary [0,1]) text as either an insult or non-insult. The links to the original code are included in my original post.
def runClassifiers(X_train, y_train, X_test, y_test = None, verbose = True):
models = [ linear_model.LogisticRegression(C=3),
svm.SVC(C=0.3,kernel='linear', probability=True)]
# another two classifiers are commented out by the original author
dense = [False, False, True, True] # if model needs dense matrix
X_train_dense = X_train.todense()
X_test_dense = X_test.todense()
preds = []
for ndx, model in enumerate(models):
t0 = time()
print "Training: ", model, 20 * '_'
if dense[ndx]:
model.fit(X_train_dense, y_train)
pred = model.predict_proba(X_test_dense)
else:
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)
print "Training time: %0.3fs" % (time() - t0)
preds.append(array(pred[:,1]))
final_pred = preds[0]*0.4 + preds[1]*0.6
return final_pred