5

I am trying to run different regression models on the Prostate cancer data from the lasso2 package. When I use Lasso, I saw two different methods to calculate the mean square error. But they do give me quite different results, so I would want to know if I'm doing anything wrong or if it just means that one method is better than the other ?

# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min

# Getting predictions on the "test" data set and calculating the mean     square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest) 

# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

So these are the outputs I got for both MSE:

MSE (method 1): 0.4609978 
MSE (method 2): 0.5654089 

And they're quite different. Does anyone know why ? Thanks a lot in advance for your help!

Samuel

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
chogall
  • 75
  • 1
  • 2
  • 5
  • If I'm reading correctly, `mse.1` is test MSE and `mse.2` is the cross validation error from the selected model—but based solely on training data. – alistaire Sep 14 '16 at 05:06
  • Thanks for pointing this out alistaire. So the right sequence would be to run the cv.glmnet over the training data in order to get the best lambda and then use the method 1 for MSE ? I assume it doesn't make sense to re-run cv.glmnet a second time over the test data in order to take the cvm (mean cross validated error) ? Sorry, I'm a bit confused. – chogall Sep 14 '16 at 05:35
  • You use cross-validation to give you estimates of test error in order to figure out which lambda value to use on your test data. You should only touch your test data once. – alistaire Sep 14 '16 at 05:47
  • @alistaire Thanks! So I can drop the second method, it actually doesn't make sense. – chogall Sep 14 '16 at 09:05

1 Answers1

4

As pointed out by @alistaire, in the first case you are using the test data to compute the MSE, in the second case the MSE from the cross-validation (training) folds are reported, so it's not an apples to apples comparison.

We can do something like the following to do apples to apples comparison (by keeping the fitted values on the training folds) and as we can see mse.1 and mse.2 are exactly equal if computed on the same training folds (although the value is little different from yours, with my desktop R version 3.1.2, x86_64-w64-mingw32, windows 10):

# Needs the following R packages.
library(lasso2)
library(glmnet)

# Gets the prostate cancer dataset
data(Prostate)

# Defines the Mean Square Error function 
mse = function(x,y) { mean((x-y)^2)}

# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))

# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)

# Training set
train = Prostate[train_ind, ]

# Test set
test = Prostate[-train_ind, ]

# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa

# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)

# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain) 
cat("MSE (method 1): ", mse.1, "\n")

MSE (method 1):  0.6044496 

# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")

MSE (method 2):  0.6044496 

mse.1 == mse.2
[1] TRUE
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Thanks for your answer @sandipan. So cv.glmnet uses a 10 fold by default. By specifying a much bigger number of folds, does it mean the result of that cross validation will be better ? – chogall Sep 15 '16 at 06:19
  • having higher number of folds may lead to overfitting, since you are getting more data to train on. – Sandipan Dey Sep 15 '16 at 06:24
  • Ok, so this is best to leave the default value of K = 10 ? Or is there a way to find the optimal number of folds ? – chogall Sep 16 '16 at 03:31
  • have a look at this http://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation – Sandipan Dey Sep 16 '16 at 04:50
  • Thanks a lot for the link! – chogall Sep 16 '16 at 07:48