9

I am trying to predict probabilities in a dataset using glmnet. My code reads:

bank <- read.table("http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT",header=TRUE)
bank$rich<-sample(c(0:1), 233, replace=TRUE)
    train=bank[1:200,];
    test=bank[201:233,]
    x=model.matrix(rich~., bank)[,-1]
    cv.out=cv.glmnet(x, train$rich, alpha=0, family="binomial")
ridge.mod=glmnet(x, train$rich, alpha=0, family="binomial")
    bank$rich <- NULL
newx = data.matrix(test$rich)
ridge.pred=predict(ridge.mod,newx=newx)

train = data[1:2500,];
test = data[2501:5088,];
x=model.matrix(Y~x1+x2+x3+x4+x5+x6, data)[,-1]
cv.out=cv.glmnet(x, data$Y, alpha=0, family="binomial")
    bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, data$Y, alpha=0, family="binomial")
    test$Y <- NULL
newx = data.matrix(test)
ridge.pred = predict(ridge.mod,newx=newx, type="response")

I keep getting this error message when using predict:

Error in as.matrix(cbind2(1, newx) %*% nbeta) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in t(.Call(Csparse_dense_crossprod, y, t(x))) : error in evaluating the argument 'x' in selecting a method for function 't': Error: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

I've tried this on the "Hitters" dataset and it works perfectly fine.

library(ISLR);
library(glmnet)
Hitters=na.omit(Hitters)

Hitters$Rich<-ifelse(Hitters$Salary>500,1,0)
Hitters.train = Hitters[1:200,]
Hitters.test = Hitters[201:dim(Hitters)[1],]
x=model.matrix(Rich~.,Hitters)[,-1]
cv.out=cv.glmnet(x, Hitters$Rich, alpha=0, family="binomial")
    bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, Hitters$Rich, alpha=0,lambda=bestlam, family="binomial")
    Hitters.test$Rich <- NULL
newx = data.matrix(Hitters.test)
ridge.pred=predict(ridge.mod,newx=newx, type="response")
head(ridge.pred)
ridge.pred[1:10,]

Does anyone know how I can fix this?

  • I'm voting to close this question as off-topic because it is about how to use R without a reproducible example. – gung - Reinstate Monica Mar 12 '15 at 15:25
  • 2
    I have added a reproducible portion above –  Mar 12 '15 at 15:44
  • Thanks! We'll see if we can migrate this for you now. – gung - Reinstate Monica Mar 12 '15 at 15:47
  • Thanks for your help.. This has been stumping me for hours. –  Mar 12 '15 at 15:58
  • 1
    Just to tag a response to this answer since it is the first google result for this specific error. In addition to the `null` issue caused by using this function with `model.matrix`, this error can also occur when your test x doesn't have the same variables found in the train x. – Vlo May 19 '15 at 20:33

7 Answers7

6

I had the same issue and I think it is caused by training and testing set having different factors thus different dimension for the sparse matrices.

My solution is to create the sparse matrix X for the combined dataset

traintest=rbind(training,testing)

X = sparse.model.matrix(as.formula(paste("y ~", paste(colnames(training[,-1]), sep = "", collapse=" +"))), data = traintest)
model = cv.glmnet(X[1:nrow(training),], training[,1], family = "binomial",type.measure = "auc",nfolds = 10)
plot(model)
model$lambda.min
#predict on test set
pred = predict(model, s='lambda.min', newx=X[-(1:nrow(training)),], type="response")

This is just to make sure test set has the same dimension.

Ruge
  • 61
  • 1
  • 1
2

Looks like you just have the wrong thing being assigned to newx. Instead of:

bank$rich <- NULL newx = data.matrix(test$rich)

you want to null out the values in test$rich and then feed test to data.matrix. So something like: test$rich <- NULL newx = data.matrix(test) ridge.pred=predict(ridge.mod,newx=newx) worked for me

Also, it looks like your original data frame has some patterns based on the row: rows after 200 have NA values in newAccount. You might want to address missing values and your train/test split before your regression

jimu
  • 29
  • 3
1

I got the same error since the training and testing datasets had different dimensions due to different factors. The problem was that the columns with factors/categorical data were defined as character columns. Thus I changed those columns from character columns to factor columns before splitting it into training and testing, and it worked!

data$factor_column_a <- as.factor(data$factor_column_a)
Spyros
  • 41
  • 3
0

I had the same issue and I was getting the same exact error, at the end non of the above worked for me but I solved the issue! as the error states clearly, there is a "wrong dimensions" problem.

About my data

In my case I trained my glmnet fit on a data with dimension of 36 x 895 and my test data was 6 x 6. the reason I had only 6 columns in my test dataset was that the lasso selected these 6 features when s="lambda.min".

My solution

I used sparse matrix from Matrix package to create a matrix (you can even use normal matrix):

sparsed_test_data <- Matrix(data=0,
                            nrow=nrow(test_data),
                            ncol=ncol(training_data),
                            dimnames=list(rownames(test_data),
                                          colnames(training_data)),
                            sparse = T)

and then I substitute the values I had in correct columns:

for(i in colnames(test_data)){
    sparsed_test_data[, i] <- test_data[, i]
}

now the predict function works fine.

Mehrad Mahmoudian
  • 3,466
  • 32
  • 36
0

I've seen this error before as well. The problem in my data set was that factor variables in my training and test sets had different number of levels. make sure that is not the case.

ekardes
  • 552
  • 1
  • 4
  • 8
0

I'm posting an answer because this question still shows up in searches. The code below runs. I ran into several problems trying to replicate the example. There is missing data in bank; I deleted those observations. Also, the generated prediction is constant (0.4875) because the ridge regression sets all variables other than the constant term to (almost) zero (not surprising with a simulated value of rich).

library(caret) ## 6.0-81
library(glmnet) ## 2.0-16
url <- "http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT"
bank <- read.table(url, header=TRUE)
set.seed(1)
bank$rich <- sample(c(0:1), nrow(bank), replace=TRUE)
bank <- na.omit(bank)
trainbank <- bank[1:160, ]
testbank <- bank[161:200, ]
x <- model.matrix(rich~., trainbank)[,-1]
y <- trainbank$rich
cv.out <- cv.glmnet(x, y, alpha=0, family="binomial")
x.test <- model.matrix(rich ~ ., testbank)[,-1]
pred <- predict(cv.out, type='response', newx=x.test)
Robert McDonald
  • 1,250
  • 1
  • 12
  • 20
-2
ridge.mod_P@x  
coef(ridge.mod,s=cv.out$lambda.min)# coeffcience of lambda.min  
ridge.mod_P<-coef(ridge.mod,s=cv.out$lambda.min)  
ridge.mod_P  
matrix(ridge.mod_P@x)  
coe<-matrix(ridge.mod_P@x)  
coe2<-coe[-1,]#1  
newx16<-newx[,-17]  
newx16
newx16%*% matrix(coe2)# NA, This is reason of outputNA.
newx16<-newx[,-c(1,17)]  
coe2<-coe[-(1:2),]#16  
newx16%*% matrix(coe2)#yHat : coefficient and variable.
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
heeseon
  • 23
  • 5