1

I have 'train.csv' and 'test.csv' files. The former contains 'Id', a list of features, and a 'Status' column with values in it, the 'test.csv' file contains the same columns except the 'Status' one.

My task is to train an XGboost model on the 'train.csv' file and predict binary outcome of 'Status' for the 'test.csv' file, then to save 'Id' and 'Status' to a separate csv file for submission.

I am able to train XGboost on the 'train' file, and the roc_auc score is pretty good (above 0.8). I have spent hours searching the internet how to make predictions for the 'test' file and save them to the 'submission' file. To my surprise, and although this is quite a common task, I couldn't find any scripts that would reliably perform the operations described above.

My working code for the 'train.csv' file just in case:

predict = pd.read_csv("train.csv")
predictors =['par48','par52','par75','par82','par84','par85','par86','par87','par89','par108','par109','par132','par156','par165','par167','par175','par190','par197']
X, y = predict[predictors], predict['Status']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_cl=xgb.XGBClassifier(objective='binary:logistic',n_estimators=10,seed=123)
xg_cl.fit(X_train, y_train)
preds=xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
print(xg_cl.feature_importances_)
print(roc_auc_score(y_test, xg_cl.predict_proba(X_test)[:,1]))

Do you have a working code to share? Thanks!

Vladimir
  • 13
  • 3

1 Answers1

1

Well, the model.predict code returns the predicted results in an array format, so, first you need to read the separate test file if it exists, then you can use the model you have built from the training data to predict the output. Finally, you can add that array of predictions to the pandas DataFrame that you read as a new column and then write it to a csv file:

#Separate test (evaluation) dataset that doesn't include the output
test_data = pd.read_csv('test.csv')
#Choose the same columns you trained the model with
X = test_data[predictors]  
test_data['predictions'] = xg_cl.predict(X)
test_data.to_csv('submission.csv')
O.Suleiman
  • 898
  • 1
  • 6
  • 11