Why are the predicted values of my GLM cyclical?

Question

I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:

    v   n river_dist
1 102 256       1040
2   1  11        720
3  19  24        475
4  12  15        611

Which I fit to this model:

library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
    start = list(a = 0, br = 0), data = ig)

This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):

exp(coef(m_r))

That's all good. But when I try to predict new values, I get this odd cycling of values:

newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)

Example of predicted values:

   river_dist          v
1     475.0000 216.855114
2     480.7071   9.285536
3     486.4141  20.187424
4     492.1212  12.571487
5     497.8283 213.762248
6     503.5354   9.150584
7     509.2424  19.888471
8     514.9495  12.381805
9     520.6566 210.476312
10    526.3636   9.007289
11    532.0707  19.571218
12    537.7778  12.180629

Why are the values cycling up and down like that, creating crazy spikes when graphed?

it seems as if the predictions are cycling in steps of four as you pass four rows of data -- so it is cycling the n perhaps. For predictions you can do `plogis(tcrossprod(coef(m_r), cbind(1, newdat$river_dist)))` but that doesn't answer you question. — user20650, May 25 '20 at 17:15
... so try using `newdat$n = 1` . ps you dont need to use `ig$` as you are using the `data=` i.e. Use `mle2(v ~ dbinom(size=n, ... ` — user20650, May 25 '20 at 17:18
@user20650, didn't mean to steal this. (For comments that are this clear, you might as well post as an answer ...) — Ben Bolker, May 25 '20 at 17:30

Ben Bolker · Accepted Answer · 2020-05-25T18:12:27.520

In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:

library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
    start = list(a = 0, br = 0), data = ig)

At this point, as @user20650 suggests, you'll also have to specify a value (or values) for n in newdata.

This model appears to be identical to binomial regression: is there a reason not to use

glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial)

? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)

Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.

m_r <- mle2(v ~ dbinom(size=n,
        prob = 1/(1+exp(-(a + br * river_dist)))),
            start = list(a = 0, br = 0), data = ig,
            method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
              function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
              c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
    geom_point(aes(size=n)) +
    geom_linerange(aes(ymin=lwr,ymax=upr)) +
    geom_smooth(method="glm",
                method.args=list(family=binomial),
              aes(weight=n))+
    geom_line(data=pframe,aes(y=prop),colour="red")

Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).

how do you decide on the `parscale` values please? Is this done after fitting the model once, then scaling on a second run or? — user20650, May 25 '20 at 18:28
you could run and rescale, or just set `parscale` to approximately `1/` — Ben Bolker, May 25 '20 at 18:36
Wow, this is awesome! Thank you! I suspect my original instructor in R stats taught us mle2 so we'd be forced to think about our starting values explicitly; it's great to know about the glm function for the future. And that graph is gorgeous! You're a lifesaver. — Lauren Pratt, May 26 '20 at 19:44
Side-note follow up: Is there a benefit to running the prediction from 500-1000 m, rather than the real min and max in my data? (475 and 1040 m?) — Lauren Pratt, May 26 '20 at 20:33

Why are the predicted values of my GLM cyclical?

1 Answers1