10

I'm running an elastic net on a generalized linear model with the glmnet and caret packages in R.

My response variable is cost (where cost > $0) and hence I'd like to specify a Gaussian family with a log link for my GLM. However glmnet doesn't seem to allow me to specify (link="log") as follows:

> lasso_fit <- glmnet(x, y, alpha=1, family="gaussian"(link="log"), lambda.min.ratio=.001)

I've tried different variants, with and without quotations, but no luck. The glmnet documentation doesn't discuss how to include a log link.

Am I missing something? Does family="gaussian" already implicitly assume a log link?

RobertF
  • 824
  • 2
  • 14
  • 40
  • 1
    I think this could be difficult. If you dig into the `glmnet` code you'll see that `glmnet(..., family="gaussian")` calls `elnet` , which calls the Fortran `spelnet` function. (Poisson regression calls `fishnet`, which calls `fishnet` or `spfishnet` (for dense vs sparse model matrices.) So I suspect someone would have to start from scratch to write a variant of elastic net that handled the log linkage ... – Ben Bolker Aug 08 '14 at 20:14

2 Answers2

5

It is a bit confusing, but the family argument in glmnet and glm are quite different. In glm, you can specify a character like "gaussian", or you can specify a function with some arguments, like gaussian(link="log"). In glmnet, you can only specify the family with a character, like "gaussian", and there is no way to automatically set the link through that argument.

The default link for gaussian is the identity function, that is, no transformation at all. But, remember that a link function is just a transformation of your y variable; you can just specify it yourself:

glmnet(x, log(y), family="gaussian")

Also note that the default link for the poisson family is log, but the objective function will change. See the Details under ?glmnet in the first couple of paragraphs.


Your comments have led me to rethink my answer; I have evidence that it is not correct.

As you point out, there is a difference between E[log(Y)] and log(E[Y]). I think what the above code does is to fit E[log(Y)], which is not what you want. Here is some code to generate data and confirm what you noted in the comments:

# Generate data
set.seed(1)
x <- replicate(3,runif(1000))
y <- exp(2*x[,1] + 3*x[,2] + x[,3] + runif(1000))
df <- data.frame(y,x)

# Run the model you *want*
glm(y~., family=gaussian(link="log"), data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4977746   2.0449443   3.0812333   0.9451073 

# Run the model you *don't want* (in two ways)    
glm(log(y)~., family=gaussian(link='identity'), data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4726745   2.0395798   3.0167274   0.9957110 
lm(log(y)~.,data=df)$coef
# (Intercept)          X1          X2          X3 
#   0.4726745   2.0395798   3.0167274   0.9957110 

# Run the glmnet code that I suggested - getting what you *don't want*.
library(glmnet)
glmnet.model <- glmnet(x,log(y),family="gaussian", thresh=1e-8, lambda=0)
c(glmnet.model$a0, glmnet.model$beta[,1])
#        s0        V1        V2        V3 
# 0.4726745 2.0395798 3.0167274 0.9957110 
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • Great, thank you. If I specify glmnet(x, log(y), family="gaussian") as you suggested, glmnet will actually be estimating log[E(y)] (which I want) rather than E[log(y)] (a loglinear OLS, which I don't want), is that correct? – RobertF Aug 08 '14 at 16:39
  • 4
    To be honest, I didn't realize that log[E(y)] wasn't exactly equivalent to E[log(y)] until I looked it up five minutes ago. I ran some code to test that `glm(y~x,family=gaussian(link="log"))` was equivalent to `glm(log(y)~x,family=gaussian)`, and it was, so maybe I am missing something fundamental here. – nograpes Aug 08 '14 at 17:04
  • Hmm, I double-checked on my own data, and I'm seeing different regression coefficients for glm(log(y)~x,family=gaussian) vs. glm(y~x,family=gaussian(link="log")), while lm(log(y)~x) and glm(log(y)~x,family=gaussian) produce identical coefficients. – RobertF Aug 08 '14 at 17:58
  • You might have given me that check a little too soon. I reran some code, and realized that my answer is likely incorrect. I'll leave it up until a better answer comes along. – nograpes Aug 08 '14 at 19:29
  • 1
    No worries - I'll check with the glmnet authors. I can't believe they overlooked this since GLMs on a Gaussian family with a log link is fairly common with healthcare cost data. – RobertF Aug 08 '14 at 19:41
  • I can certainly believe they overlooked this. It's not a trivial change, but probably one that requires a bit of rethinking of the underlying code/algorithm -- hopefully not too hard, but not something that they just forgot to include. – Ben Bolker Aug 08 '14 at 20:39
  • @BenBolker - I emailed Trevor Hastie, maybe I'm missing something. Otherwise it does sound like an interesting and challenging programming problem. :) – RobertF Aug 08 '14 at 20:51
  • @RobertF Did you ever hear back? – RNs_Ghost Jul 28 '17 at 12:46
  • 1
    @RNs_Ghost No - reading through nograpes's response again, the only other possibility is that using the Poisson link in glmnet is often used as an "equivalent" to a Gaussian family with a log link. However if you run the following code on nograpes toy dataset: `glmnet.model <- glmnet(x,y,family="poisson", thresh=1e-8, lambda=0); c(glmnet.model$a0, glmnet.model$beta[,1])`, the results still don't quite match the `glm()` model. – RobertF Jul 28 '17 at 14:19
2

I know that this is an old question, but in the current version of (4.0-2) it is possible to use glm family functions as arguments to "family" instead of a character string, so you could use:

glmnet(x, y, family=gaussian(link="log"))

Note that the package is faster when you use the string arguments.

Reference: https://glmnet.stanford.edu/articles/glmnetFamily.html

Let's try
  • 1,044
  • 9
  • 20
Andre Barbosa
  • 142
  • 1
  • 8