7

Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.

The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set

datf <- data.frame(prop  = c(0.125, 0,  0.667, 1,  0.9),
                   cases = c(8,     1,  3,     3,  10),
                   x     = c(11,    12, 15,    16, 18))

datf2         <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases   <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2          <- glm(success ~ x, weight=cases, data=datf2, family="binomial")

datf$proppredicted    <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))

producing a chart like logistic prediction

which looks reasonably sensible.

But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?

As a lesser question, is there a cleaner way of calculating the predicted proportions?

Henry
  • 6,704
  • 2
  • 23
  • 39
  • 2
    Close voters: this is a question about how to use `glm`, not about the statistical qualities of the model. – Hong Ooi Apr 28 '18 at 17:09
  • Weights should be the number of trials, not the number of successes. – Slouei Apr 22 '20 at 16:00
  • @Slouei `weight=cases` is both the number of successes (when `success==1`) and the number of non-successes (when `success==0`) so in total is all the trials – Henry Apr 22 '20 at 20:03

1 Answers1

14

No need to construct artificial data like that; glm can fit your model from the dataset as given.

> glm(prop ~ x, family=binomial, data=datf, weights=cases)

Call:  glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)

Coefficients:
(Intercept)            x  
    -9.3533       0.6714  

Degrees of Freedom: 4 Total (i.e. Null);  3 Residual
Null Deviance:      17.3 
Residual Deviance: 2.043    AIC: 11.43

You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:

> fit2

Call:  glm(formula = success ~ x, family = "binomial", data = datf2, 
    weights = cases)

Coefficients:
(Intercept)            x  
    -9.3532       0.6713  

Degrees of Freedom: 7 Total (i.e. Null);  6 Residual
Null Deviance:      33.65 
Residual Deviance: 18.39    AIC: 22.39

The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • That is helpful - I had been worried about the `Warning message: In eval(family$initialize) : non-integer #successes in a binomial glm!` and had tried to avoid it with the rounding. As for the residual deviance and AIC, I suspect both methods may be misleading: if I had the original 25 observations and used `glm` unweighted, I would get very different numbers – Henry Apr 28 '18 at 19:41
  • why is weights=caseshere? – stats_noob Nov 24 '22 at 03:58