5

I'm trying to fit a logistic regression using glm( family='binomial').

Here is the model:

model<-glm(f_ocur~altitud+UTM_X+UTM_Y+j_sin+j_cos+temp_res+pp, 
           offset=(log(1/off)), data=mydata, family='binomial')

mydata has 76820 observations. The response variable (f_ocur) is 0-1.
This data is a sample of a bigger dataset, so the idea of setting the offset is to account that the data used here represents a sample of the real data to be analysed.

For some reason the offset is not working. When I run this model I get a result, but when I run the same model but without the offset I get the exact same result as the previous model. I was expecting a different result but there is no difference.

Am I doing something wrong? Should the offset be with the linear predictors? like this:

model <- glm(f_ocur~altitud+UTM_X+UTM_Y+j_sin+j_cos+temp_res+pp+offset(log(1/off)), 
             data=mydata, family='binomial')

Once the model is ready, I´d like to use it with new data. The new data would be the data to validate this model, this data has the same columns. My idea is to use:

validate <- predict(model, newdata=data2, type='response')

And here comes my question, does the predict function takes into consideration the offset used to create the model? If not, what should I do in order to get the correct probabilities for the new data?

rcs
  • 67,191
  • 22
  • 172
  • 153
lpchaparro
  • 129
  • 2
  • 6
  • I think what you want are really *weights* rather than *offsets*, but I don't think that you can use them the way you want in `glm`, which always normalizes the sum of the weights to unity. You might be better off simply scaling all your standard errors by `sqrt(sample_fraction)`, where `sample_fraction` is the fraction of your data set you are taking. – Ben Bolker Nov 06 '12 at 03:03
  • Thank you fo your answer. I´ve been reading about weights but that is not what i´m looking for because I don´t think adding more trials would be helpful to consider the "complete data". I might be missinterpreting what the wights are, if so, please correct me. – lpchaparro Nov 06 '12 at 13:49
  • @ben-bolker I've posted an answer drawn from your Rpub - is this on the right track? – Nathan Dec 16 '15 at 03:45

2 Answers2

5

I think the log offset is used with Poisson family. In case of binomial you should not use log. Check the link https://stats.stackexchange.com/questions/25415/using-offset-in-binomial-model-to-account-for-increased-numbers-of-patients

Community
  • 1
  • 1
Aybek Khodiev
  • 596
  • 1
  • 4
  • 10
3

Looking at your question I'm guessing your primary question is about why the offset isn't making any difference.

Stealing a sentence from a @Ben Bolker Rpub (https://rpubs.com/bbolker/logregexp): "A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link."

So on that basis I would suggest that the code you are looking for maybe:

model <- glm(f_ocur~altitud+UTM_X+UTM_Y+j_sin+j_cos+temp_res+pp, 
       data=mydata, family = binomial(link = cloglog),offset=log(1/off))

Below is a little example which shows that not only are the results different with and without the offset but using AICc model selection, the better model is ranked higher despite "time" being confounded with "site".

library(AICcmodavg)

set.seed(1)
time <- c(rep(1,50),rep(2,50))
site <- c(rep("site 1",50),rep("site 2",50))
surv <- c(rbinom(50,1,prob=0.7),rbinom(50,1,prob=0.7^2))

my.data <- data.frame(surv, site, time)

# setup AICc model list
Cand.models <- list( )

Cand.models[[1]] <- glm(surv ~ 1, data=my.data, family = binomial(link = cloglog), offset=log(1/time))
Cand.models[[2]] <- glm(surv ~ 1, data=my.data, family = binomial(link = cloglog))
Cand.models[[3]] <- glm(surv ~ site , data=my.data, family = binomial(link = cloglog), offset=log(1/time))

# create a vector of names to trace back models in set
Modnames <- paste("mod", 1:length(Cand.models), sep = " ")

# generate AICc table
aictab(cand.set = Cand.models, modnames = Modnames, sort = TRUE)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Nathan
  • 646
  • 5
  • 10
  • this seems reasonable. I still don't understand the OP's comment that the offset makes **no** difference; that seems surprising, and there is no reproducible example given ... – Ben Bolker Dec 16 '15 at 14:50