2

I am beginner in machine learning and I have been trying to understand the process in much more detail.

For any machine learning scenario:

(1) The first step I do is that I split my data in a ratio of 90% to 10% and I keep the 10% for testing at the very last step

Code:

X1, X_Val, y1, y_Val = train_test_split(X, y, test_size=0.1, 
random_state=101)

(2) The second step, if my data permits (not too big), I run a K-Fold Cross Validation on the data.

From that score, I can get Bias, Variance and Accuracy of the model that I selected.

From here, I can tune the model as in tune hyperparameters, do feature selection and try different algorithms (random forrest etc..) to see what gives the best solutions

Code:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

logreg = LogisticRegression()

scores = cross_val_score(logreg,X1,y1,cv = 10, scoring = "accuracy")

scores.mean()

scores.std()

(3) Now I use cross_val_predict to get the y predictions (y_pred)

Code:

from sklearn.model_selection import cross_val_predict
ypred = cross_val_predict(logreg,X1,y1,cv = 10)

(4) From there, I can run a classification report:

Code:

print(classification_report(y1,ypred))
accuracy_score(y1,ypred)
confusion_matrix(y1,ypred)

(5) Now if we are satisfied with the results from the classification report, we can feed in new data or unseen data(X_val,y_val), in our case the test set we removed from step 1

This is done as such:

Code:

logreg2 = LogisticRegression()
logreg2.fit(X1,y1)
y_pred2 = logreg2.predict(X_Val)

Then we can run another classification_report with (y_Val,y_pred2)

I have 2 questions from the above:

(1) Are the steps correct? Please feel free to let me know if I have missed anything.

(2) What should I report as the actual accuracy of my model, the classification report from step 5 or from step 4?

Thank you very much for your help

Ali Parahoo
  • 161
  • 2
  • 11

1 Answers1

3

Your procedure is correct in general. The discussion in Order between using validation, training and test sets will be useful. Minor issues/clarifications:

  • In step #1, we usually use the term "test set" and not "validation set" (the validation part is covered by K-fold CV here), so x_test and y_test would be more appropriate variable names.

  • In step #5, it is expected that you will use the specific hyperparameters selected during cross-validation (your example does not explicitly show this).

Since you have used a test set for the final assessment of your model, the correct thing here would be to report the results from step #5 indeed; nevertheless, you can always report the results from step #4 as well, as long as you provide the proper clarifications, i.e. "CV accuracy x, test accuracy y".

desertnaut
  • 57,590
  • 26
  • 140
  • 166