I performed the following on a data set that contains 151 variables with 161 observations:-
> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)
And got the following results:-
fold 1
Observations in test set: 16
7 11 12 24 33 38 52 67 72
Predicted 49.6 44.1 26.4 39.8 53.3 40.33 47.8 56.7 58.5
cvpred 575.0 -113.2 640.7 -1045.8 876.7 -5.93 2183.0 -129.7 212.6
RT..seconds. 42.0 44.0 44.0 45.0 45.0 46.00 49.0 56.0 58.0
CV residual -533.0 157.2 -596.7 1090.8 -831.7 51.93 -2134.0 185.7 -154.6
What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?
I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-
library(caret) ctrl <- trainControl(method = "cv", savePred=T, classProb=T) mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl) mod$pred
pred obs rowIndex .parameter Resample
1 141.2 42 6 none Fold01
2 -504.0 42 7 none Fold01
3 1196.1 44 16 none Fold01
4 45.0 45 27 none Fold01
5 262.2 45 35 none Fold01
6 570.9 52 58 none Fold01
7 -166.3 53 61 none Fold01
8 -1579.1 59 77 none Fold01
9 2699.0 60 79 none Fold01
The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.
If someone could explain this to me I'd be grateful, thanks