0

I have a cv.glmnet to use to predict new data. I have a problem when creating the model matrix for new data to be predicted using cv.glmnet object. I need to block bootstrap the test data and predict the response for all samples. The problem happens when in some samples, some of the categorical variables have only one level. Then I get an error when creating the model matrix. Here is an example.

library(splines)
library(caret)
library(glmnet)

data(iris)
Inx <- sample(nrow(iris),100)
iris$Species <- factor(iris$Species)

train_data <- iris[Inx, ]
test_data <- iris[-Inx,]

Formula <- "Sepal.Length ~ Sepal.Width + Petal.Length + Species:Petal.Width + Sepal.Width:Petal.Length +  Species +  bs(Petal.Width, df = 2, degree = 2)"
ModelMatrix <- predict(caret::dummyVars(Formula, train_data, fullRank = T, sep = ""), train_data)
y = train_data[,"Sepal.Length"]

cvglm <- cv.glmnet(x = ModelMatrix,y = train_data$Sepal.Length,nfolds = 4, 
                   keep = TRUE, alpha = 1, parallel = F, type.measure = 'mse')
test_data$Species <- "virginica"
ModelMatrix_test <- predict(caret::dummyVars(Formula, test_data, fullRank = T, sep = ""), test_data)

Then I get this error

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Any suggestions to solve the problem would be appreciated.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
Nile
  • 303
  • 2
  • 11
  • Why are you using `caret::dummyVars` here? Does `ModelMatrix <- model.matrix(as.formula(Formula), data = train_data)` give you what you're looking for? – RyanFrost Apr 30 '20 at 00:31
  • model.matrix throughs the same error. I just found a page which compares model.marix and caret::dummyVars https://rsangole.netlify.app/post/dummy-variables-one-hot-encoding/ – Nile Apr 30 '20 at 01:52

1 Answers1

1

The error is very simple, the dependent variable in test_data contains only one species (virginica). So, contrasts can not be applied. Calculation of contrasts are only possible when your dependent variable (in your case it is species) is having to 2 or more levels (e.g. versicolor and virginica or setosa, versicolor and virginica). To achieve that you can modify your sample command like

library(splines)
library(caret)
#> Warning: package 'caret' was built under R version 3.6.2
#> Loading required package: lattice
#> Loading required package: ggplot2
library(glmnet)
#> Warning: package 'glmnet' was built under R version 3.6.2
#> Loading required package: Matrix
#> Loaded glmnet 3.0-2

data(iris)

set.seed(123)
Inx <- sample(nrow(iris), 0.7 * nrow(iris))
iris$Species <- factor(iris$Species)

train_data <- iris[Inx, ]
test_data <- iris[-Inx,]

Formula <- "Sepal.Length ~ Sepal.Width + Petal.Length + Species:Petal.Width + Sepal.Width:Petal.Length +  Species +  bs(Petal.Width, df = 2, degree = 2)"
ModelMatrix <- predict(caret::dummyVars(Formula, train_data, fullRank = T, sep = ""), train_data)
y = train_data[,"Sepal.Length"]

cvglm <- cv.glmnet(x = ModelMatrix,y = train_data$Sepal.Length,nfolds = 4, 
                   keep = TRUE, alpha = 1, parallel = F, type.measure = 'mse')

ModelMatrix_test <- predict(caret::dummyVars(Formula, test_data, fullRank = T, sep = ""), test_data)

It is always better practice to divide the data in train and test set in such a way that both the dataset should represent the original dataset which can be achieved by random sampling.

Created on 2020-04-30 by the reprex package (v0.3.0)

UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • 1
    That was just a reproducible example to explain the error. In the original problem, I am working on, it is not up to me to split the data to test and train data because I need to forecast a time series. and the categorical variable can be for example (weekend or is a public holiday) and when I bootstrap blocks of for example 4 days, then it might or might not include a public holiday/ weekend – Nile Apr 30 '20 at 23:39
  • Whatever may be the case but if you want to avoid the error you have to sample the data in such a way that your dependent variable should contain 2 or more levels. – UseR10085 May 01 '20 at 04:30