2

I can't understand the output of

kfold_results = cross_val_score(xg_cl, X_train, y_train, cv=kfold, scoring='roc_auc')

The output of xgb.cv is clear - there are the train and test scores:

[0] train-auc:0.927637+0.00405497   test-auc:0.788526+0.0152854
[1] train-auc:0.978419+0.0018253    test-auc:0.851634+0.0201297
[2] train-auc:0.985103+0.00191355   test-auc:0.86195+0.0164157
[3] train-auc:0.988391+0.000999448  test-auc:0.870363+0.0161025
[4] train-auc:0.991542+0.000756701  test-auc:0.881663+0.013579

But the result of cross_val_score in Sk-learn wrapper is umbiguous: it is a list of scores after each fold, but: -whether the result of test_data or of train_data?

Alex Ivanov
  • 657
  • 1
  • 8
  • 17

1 Answers1

1

Kfold splits the data in the number of folds being passed, Changed in version 0.20: cv default value if None will change from 3-fold to 5-fold in v0.22. from sklearn. So what it does is split the dataset in 5 subsets (default for version 0.22), uses 4 as train, and 1 as validation. Therefore the output is an array of 5 items, 1 for each iteration. This is what it would look like: enter image description here

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • Celius Stingher, thank you very much for the answer and the image. Whould you, nevetheless, please, specify, whether it outputs in all 5 cases the predictions accuracy of test data, train data , or test + train data? – Alex Ivanov Sep 25 '19 at 15:15
  • If the answer is help make sure to vote it and accept it to give it more visibility for other people facing the same problem! – Celius Stingher Sep 25 '19 at 15:16
  • I thought, that every fold splits data into training and test sets... Do I understand you right, that 1-4 folds calculate accuracy only on train tests? And 5th - on test set? – Alex Ivanov Sep 25 '19 at 15:51
  • 1
    It validates on the test set (which would be the 5th one in the first iteration). – Celius Stingher Sep 25 '19 at 15:57
  • As far as I understood, if we split data for 3 times, each kfold: 1/3 - test set, 2/3 - train set. It trains model on train set and validates on test set. And it repeats nfold times. Therefore, we have 3 output - 1 validation results on the test set for each of 3 folds. Would you tell, please, what is the meaning of the log-output of xgboost wrapper of cross-validation? E.g.: [0] train-auc:0.927637+0.00405497 test-auc:0.788526+0.0152854 (it can be hundreeds of the iterations). I understand that it shows every boosting round. But what number of fold this output is related to? Thank you. – Alex Ivanov Sep 26 '19 at 10:10