I have 'train.csv' and 'test.csv' files. The former contains 'Id', a list of features, and a 'Status' column with values in it, the 'test.csv' file contains the same columns except the 'Status' one.
My task is to train an XGboost model on the 'train.csv' file and predict binary outcome of 'Status' for the 'test.csv' file, then to save 'Id' and 'Status' to a separate csv file for submission.
I am able to train XGboost on the 'train' file, and the roc_auc score is pretty good (above 0.8). I have spent hours searching the internet how to make predictions for the 'test' file and save them to the 'submission' file. To my surprise, and although this is quite a common task, I couldn't find any scripts that would reliably perform the operations described above.
My working code for the 'train.csv' file just in case:
predict = pd.read_csv("train.csv")
predictors =['par48','par52','par75','par82','par84','par85','par86','par87','par89','par108','par109','par132','par156','par165','par167','par175','par190','par197']
X, y = predict[predictors], predict['Status']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_cl=xgb.XGBClassifier(objective='binary:logistic',n_estimators=10,seed=123)
xg_cl.fit(X_train, y_train)
preds=xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
print(xg_cl.feature_importances_)
print(roc_auc_score(y_test, xg_cl.predict_proba(X_test)[:,1]))
Do you have a working code to share? Thanks!