1

I performed the following on a data set that contains 151 variables with 161 observations:-

> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)

And got the following results:-

fold 1 
Observations in test set: 16 
                  7     11     12      24     33    38      52     67     72
Predicted      49.6   44.1   26.4    39.8   53.3 40.33    47.8   56.7   58.5
cvpred        575.0 -113.2  640.7 -1045.8  876.7 -5.93  2183.0 -129.7  212.6
RT..seconds.   42.0   44.0   44.0    45.0   45.0 46.00    49.0   56.0   58.0
CV residual  -533.0  157.2 -596.7  1090.8 -831.7 51.93 -2134.0  185.7 -154.6

What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?

I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-

library(caret) ctrl <- trainControl(method = "cv", savePred=T, classProb=T) mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl) mod$pred

        pred obs rowIndex .parameter Resample
1      141.2  42        6       none   Fold01
2     -504.0  42        7       none   Fold01
3     1196.1  44       16       none   Fold01
4       45.0  45       27       none   Fold01
5      262.2  45       35       none   Fold01
6      570.9  52       58       none   Fold01
7     -166.3  53       61       none   Fold01
8    -1579.1  59       77       none   Fold01
9     2699.0  60       79       none   Fold01

The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.

If someone could explain this to me I'd be grateful, thanks

user2062207
  • 955
  • 4
  • 18
  • 34

1 Answers1

5

I think there are few areas of confusion here, let me try to clear the up for you.

The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.

The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.

I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:

library(DAAG)
fit <- lm(Sepal.Length ~ ., data=iris)
mod1 <- cv.lm(df=iris,fit,m=10)
mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10)
> sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))
[1] 0.318
> sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))
[1] 5.94
> sqrt(mean((mod1$cvpred - mod1$Predicted)^2))
[1] 0.0311
> sqrt(mean((mod2$cvpred - mod2$Predicted)^2))
[1] 5.94

The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).

David
  • 9,284
  • 3
  • 41
  • 40
  • 1
    Thank you for that, that really helps. If I could, I would honestly pay you for your help. I'm just going to stick with caret in that case. When doing predict(mod, cadets), does this then simply look at the Predicted results from the model and not incorporate the cross validation of it? In other words, would it be more accurate for me to plot the predict(mod, cadets) to show how accurate the model is, or would it be better for me to plot the pred vs obs results from mod$pred? – user2062207 Dec 09 '13 at 16:50
  • Also how do I use the cross validation to then improve on the fit? – user2062207 Dec 09 '13 at 16:57
  • Think of this way, use crossvalidation results to evaluate your model and (once it has been evaluated and is deemed good to go) use `predict()` to apply your model in practice to new data. – David Dec 09 '13 at 16:58
  • The cross validation results that I'm getting are completely off from the actual values, and my RMSE is like 8000. How would I go about improving the model so it doesn't overfit with the data that I have? Would repeating the number of times I do CV work? i.e:- 'ctrl <- trainControl(method = "repeatedcv", repeats = 10, savePred=T, classProb=T, number = 10)' – user2062207 Dec 09 '13 at 17:20
  • 1
    It sounds like your model isn't very good, which is more a conceptual problem than a programatic one and is thus better suited for cross-validated than stackoverflow. But the short answer is that cross validating does not make your model better, it only shows you how good your model is so you'll need to change how you make your model (feature selection, feature engineering, model tuning etc...). – David Dec 09 '13 at 17:25