2

I am trying to fit a model to my data, which has a dependent variable that can be 0 or 1.

I tried to fit a binomial glmer to the data, but the fit is pretty bad as you can see below. This puzzles me because this is quite a sigmoid so I thought I would get a great fit with that kind of model? Am I using the wrong model? (color is my data, black is the fit) plot bad fit

Here is the code I used on r

library(lme4)
library(ggplot2)

exdata <- read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")

model=glmer(VD~ as.factor(VI2)*VI1 + (1|ID),exdata, 
            family=binomial(link = "logit"), 
            control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))
summary(model)
exdata$fit=predict(model, type = "response")

ggplot(exdata,aes(VI1, VD,  color=as.factor(VI2),
                     group=as.factor(VI2))) + 
  stat_summary(geom="line", size=0.8) +
  stat_summary(aes(y=(fit)),geom="line", size=0.8, color="black") +
  theme_bw()

And I tried without the random effect to see if it would change but no...

ggplot(exdata, aes(x=VI1, y=VD, color=as.factor(VI2),
                      group=as.factor(VI2))) + 
  stat_summary(fun.data=mean_se, geom="line", size=1)+
  stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial), color='black')

Here is the data:https://github.com/FlorianLeprevost/dummydata/blob/main/exdata.csv

FloLe
  • 41
  • 5

1 Answers1

1

tl;dr I don't think these data are as sigmoidal as you think. In particular, a logistic regression estimates a sigmoid curve that ranges from 0 to 1, whereas yours levels out (sort of) at 0.9. In much more detail:

slightly streamlined data import/model fitting

library(lme4)
library(ggplot2)
library(dplyr)

exdata <- (read.csv("https://raw.githubusercontent.com/FlorianLeprevost/dummydata/main/exdata.csv")
    |> mutate(across(VI2, factor))
)
model <- glmer(VD~ VI2*VI1 + (1|ID),
            exdata, 
            family=binomial(link = "logit")) 

compute data summary and predictions

This can also be done with stat_summary(), but I like the finer control of doing it myself. In particular, I like to get Clopper-Pearson CIs on the proportions (could also do this with prop.test() to get score-test CIs). I'm also computing predictions across a wider VI1-range than the data (see why below).

ddsum <- (exdata
    |> group_by(VI1, VI2)
    |> summarise(
           num = n(),
           pos = sum(VD),
           VD = mean(VD),
           lwr = binom.test(pos, num)$conf.int[1],
           upr = binom.test(pos, num)$conf.int[2],
           .groups = "drop")
)

pframe <- expand.grid(
    VI1 = seq(-50, 50, length = 101),
    VI2 = unique(exdata$VI2))
pframe$VD <- predict(model, newdata = pframe, re.form = ~0, type = "response")

plot

gg0 <- ggplot(ddsum,aes(x=VI1, y=VD,  color=VI2)) + 
    geom_pointrange(position = position_dodge(width = 0.3),
                    aes(ymin = lwr, ymax = upr, size = num), alpha = 0.5) +
    scale_size_area(max_size = 4) +
    theme_bw()
gg1 <- gg0 + geom_line(data = pframe)
ggsave(g1, file = "gglogist1.png")

enter image description here

Conclusion: the sharp increase from x=0 to x=15 combined with the saturation below 1.0 makes it hard to fit with a logistic curve.

We could try a quadratic-logistic fit:

model2 <- update(model, . ~ poly(VI1,2)*VI2 + (1|ID))
pframe$VD2 <- predict(model2, newdata = pframe, re.form = ~0, type = "response")
gg2 <- gg1 + geom_line(data=pframe, aes(y=VD2), linetype = 2)
ggsave(gg2, file = "gglogist2.png")

enter image description here

This fits better (it couldn't fit worse), but might not make sense for your application.

In principle we could fit a logistic that saturated at a value <1, but it's a bit tricky ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks a lot for your answer it is clear and helpful as always! Indeed a quadratic fit doesn't really make sense with the data (basically my DV is choice of fastest route or not). Do you know any resource for fitting a logistic that doesn't saturate at 1? – FloLe Nov 18 '22 at 08:59