0

I am having an issue with creating a matrix of explanatory variables for running ridge and lasso regression using cv.glmnet.

My original data frame is of dimension 1460*81 and consist of several numeric and factor variables. In order to run glmnet, I am attempting to create a matrix of predictors using model.matrix.

However, when creating model.matrix on my original dataset, some of the rows are being dropped and my response variable and predictors are not of the same length.

Here's the code:

str(train1)
'data.frame':   1460 obs. of  80 variables:
$ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage  : num  65 80 68 60 84 85 75 69 51 50 ...
$ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 
$ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley        : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
$ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 
$ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 
$ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...

And now I am passing the data frame to model.matrix to create a matrix.

x = model.matrix(SalePrice ~., data = train1)
dim(x)

dim(x)
[1] 1370  260

Notice, how n = 1460 * 80 is transformed to 1370 * 260. This is causing a mismatch between lengths of my predictor variables and response variable when I try to run ridge regression.

cv.ridge <- glmnet(x, y, alpha = 0)

Error in glmnet(x, y, alpha = 0) : 
number of observations in y (1460) not equal to the number of rows of x (1370)

Any ideas on where to look to ensure the length of the matrix (x) is equal (y)?

kms
  • 1,810
  • 1
  • 41
  • 92

0 Answers0