I am beginner in machine learning and I have been trying to understand the process in much more detail.
For any machine learning scenario:
(1) The first step I do is that I split my data in a ratio of 90% to 10% and I keep the 10% for testing at the very last step
Code:
X1, X_Val, y1, y_Val = train_test_split(X, y, test_size=0.1,
random_state=101)
(2) The second step, if my data permits (not too big), I run a K-Fold Cross Validation on the data.
From that score, I can get Bias, Variance and Accuracy of the model that I selected.
From here, I can tune the model as in tune hyperparameters, do feature selection and try different algorithms (random forrest etc..) to see what gives the best solutions
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg = LogisticRegression()
scores = cross_val_score(logreg,X1,y1,cv = 10, scoring = "accuracy")
scores.mean()
scores.std()
(3) Now I use cross_val_predict to get the y predictions (y_pred)
Code:
from sklearn.model_selection import cross_val_predict
ypred = cross_val_predict(logreg,X1,y1,cv = 10)
(4) From there, I can run a classification report:
Code:
print(classification_report(y1,ypred))
accuracy_score(y1,ypred)
confusion_matrix(y1,ypred)
(5) Now if we are satisfied with the results from the classification report, we can feed in new data or unseen data(X_val,y_val), in our case the test set we removed from step 1
This is done as such:
Code:
logreg2 = LogisticRegression()
logreg2.fit(X1,y1)
y_pred2 = logreg2.predict(X_Val)
Then we can run another classification_report with (y_Val,y_pred2)
I have 2 questions from the above:
(1) Are the steps correct? Please feel free to let me know if I have missed anything.
(2) What should I report as the actual accuracy of my model, the classification report from step 5 or from step 4?
Thank you very much for your help