Difference between fitted values and cross validation values from pls model in r

Question

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?

`fitted values` represent results of model development or calibration while `validation$preds` represents the predictions of cross-validation. — UseR10085, Aug 10 '20 at 07:07
Thanks Bappa Das. So do I report the final model performance based on the fitted or CV predictions? — Cathy, Aug 10 '20 at 07:14
Both should be reported and it is always advisable to test your model using independent test or validation dataset. — UseR10085, Aug 10 '20 at 07:15
Thanks. But they give different results, which one should I use to actually assess how valid the model is? I don't have a test set unfortunately. — Cathy, Aug 10 '20 at 08:16

score 0 · Answer 1 · answered Aug 10 '20 at 10:49

Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.

You probably need to explain what you mean by "valid" (in your comments).

Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.

During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.

LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.

You can also see below how different they are, first I fit a model

library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
                      method = "oscorespls")

Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:

calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}

# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548

# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985

# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615

You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

Difference between fitted values and cross validation values from pls model in r

1 Answers1