0

I am attempting to do classification prediction using glmnet, however I cannot deduce what the return object of "glmnet.predict" is supposed to represent. Using the code

mlogit_r<-glmnet(train_x, cbind(cns_label, renal_label,breast_label,nsclc_label,ovarian_label,leuk_label,colon_label, mela_label),
            family="multinomial", alpha=0)
pred <- predict(mlogit_r, train_x, type="class")

with train_x being 57(n) x 6830(p), and the y object being 57(n) x 8 (num classes). The returned prediction object is a 57 x 100 matrix with labels. Which of these are the predicted labels?

It does not show in the documentation, as it just says

The object returned depends the . . . argument which is passed on to the predict method for glmnet objects.

user3707850
  • 73
  • 3
  • 8
  • 1
    Out of curiosity what are you using glmnet multinomial prediction for? – OLIVER.KOO Feb 10 '18 at 20:19
  • 1
    @OLIVER.KOO I am attempting different methods from " The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition" on their sample dataset "NCI(microarray)" which deals with identifying genes for different cancers. – user3707850 Feb 11 '18 at 22:26
  • wow @user3707850 that sounds super cool. Can I borrow that book from you? – OLIVER.KOO Feb 11 '18 at 22:28

1 Answers1

3

When you fit a glmnet model without specifying the lambda value, by default a range containing 100 lambda values is fit. When you call predict on such a model without specifying the lambda, the predictions are made for all lambda hence you receive 100 different predictions from a 100 different models.

Usually one runs cross validation to choose one lambda that is best and then predicts using it:

library(glmnet)
data(iris)

lets use 120 rows for training:

z <- sample(1:nrow(iris), 120)

now run a 5 - fold cross validation using miss classification error to chose the best lambda:

cv_fit <- cv.glmnet(as.matrix(iris[z,-5]),
                   iris[z,5],
                   nfolds = 5,
                   type.measure = "class",
                   alpha = 0,
                   grouped = FALSE,
                   family = "multinomial")

plot(cv_fit)

enter image description here

Here you can see the lambda.min corresponding to the dashed line on the left (lambda with lowest error in 5 fold cross validation) and lambda.1se (lambda with error of 1 se withing the lowest error near it on slightly on the right.

These values are in:

cv_fit$lambda.min
#[1] 0.05560455

cv_fit$lambda.1se
#[1] 0.09717054

Now when you know the best lambda you can either build a model on 100 lambda values:

fit <- glmnet(as.matrix(iris[z,-5]),
              iris[z, 5],
              alpha = 0,
              family = "multinomial")

and predict on a specific one:

predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")

or build a model on one lambda

fit1 <- glmnet(as.matrix(iris[z,-5]),
              iris[z, 5],
              alpha = 0,
              lambda = cv_fit$lambda.min,
              family = "multinomial")

and predict without specifying lambda:

all.equal(as.vector(predict(fit, as.matrix(iris[-z,-5]), s = cv_fit$lambda.min, type = "class")),
          as.vector(predict(fit1, as.matrix(iris[-z,-5]), type = "class")))

#TRUE

To see how much the coefficients were constrained you can plot the model and the lambda used:

plot(fit, xvar = "lambda")
abline(v = log(cv_fit$lambda.min), lty = 2)

enter image description here

missuse
  • 19,056
  • 3
  • 25
  • 47
  • This is a helpful answer, thank you! Do you happen to know what does `type=class` return when I set a `dfmax` value? The return value of `predict()` is still a matrix and the `ncols` seems to vary based on `dfmax`. Say I set `dfmax=10`; based on your answer it'd seem the columns correspond to all `lambda`s that resulted in the model having non-zero coefficients for 10 or less features. Do you think this is correct? – abhgh Feb 22 '19 at 06:10
  • Glad I could help. `dfmax` constrains the number of features so the regularization parameter can be at least as small to select at max the number of features specified in `dfmax`. Yes I think you are correct. That being said I would avoid using `dfmax` parameter if I was looking to optimize predictive accuracy. – missuse Feb 22 '19 at 06:51
  • Thank you! I need to constrain the `dfmax` value because I have a requirement to constrain the model to a fix number of terms. Although setting that value doesn't seem to always work (and I see another SO thread mentioning this too: https://stats.stackexchange.com/questions/114128/glmnet-in-r-difference-between-dfmax-and-pmax ). – abhgh Feb 24 '19 at 01:53
  • Based on that thread (and comment by Hong Ooi) it looks like `pmax` is the way to go in your case. When I look at the help file of [`glmnet`](https://www.rdocumentation.org/packages/glmnet/versions/2.0-16/topics/glmnet) this makes sense. – missuse Feb 24 '19 at 08:27
  • Thanks! I think I might post a new question regarding those parameters - I find them a little confusing in the case when `family='multinomial'`. – abhgh Feb 25 '19 at 19:03
  • @missuse Can I ask why you did not break down responses (5th column of iris) using useful::build.y() ? I tried your code and they worked, but in a book I read that for y in factor, user must break it into columns, one column for each level. – Steve Nov 05 '20 at 07:17
  • @Steve do you get different results when you input it as three column matrix and when you input it as a factor of three levels? They should be the same because in the [help of the function](https://www.rdocumentation.org/packages/glmnet/versions/4.0-2/topics/glmnet) it says: . `For family="multinomial", can be a nc>=2 level factor, or a matrix with nc columns of counts or proportions` – missuse Nov 05 '20 at 08:09
  • @missuse Yes, I tried your code and also input as three column matrix; they produce same result. I am self-study learner and is sometimes confused at different sources of knowledge. Here in R documentation I cannot understand what is "nc", then in book I read it says factor must be broken down. But your code is thwarting my book reading. – Steve Nov 06 '20 at 09:11
  • Probably the book was written before the feature was added to the package. This is a common problem since packages are ever changing. "nc" - I guess it means number of classes. All the best. – missuse Nov 06 '20 at 12:40