0

So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.

So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data

So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.

If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.

Am I wrong?

JoeStat1986
  • 79
  • 1
  • 5

1 Answers1

1

Training step :

Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.

#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)

Test step :

Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.

#test step
ypred = clf.predict(Xtest)

Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)