2

I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:

    v   n river_dist
1 102 256       1040
2   1  11        720
3  19  24        475
4  12  15        611

Which I fit to this model:

library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
    start = list(a = 0, br = 0), data = ig)

This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):

exp(coef(m_r))

That's all good. But when I try to predict new values, I get this odd cycling of values:

newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)

Example of predicted values:

   river_dist          v
1     475.0000 216.855114
2     480.7071   9.285536
3     486.4141  20.187424
4     492.1212  12.571487
5     497.8283 213.762248
6     503.5354   9.150584
7     509.2424  19.888471
8     514.9495  12.381805
9     520.6566 210.476312
10    526.3636   9.007289
11    532.0707  19.571218
12    537.7778  12.180629

Why are the values cycling up and down like that, creating crazy spikes when graphed?

user20650
  • 24,654
  • 5
  • 56
  • 91
  • 1
    it seems as if the predictions are cycling in steps of four as you pass four rows of data -- so it is cycling the n perhaps. For predictions you can do `plogis(tcrossprod(coef(m_r), cbind(1, newdat$river_dist)))` but that doesn't answer you question. – user20650 May 25 '20 at 17:15
  • 1
    ... so try using `newdat$n = 1` . ps you dont need to use `ig$` as you are using the `data=` i.e. Use `mle2(v ~ dbinom(size=n, ... ` – user20650 May 25 '20 at 17:18
  • 1
    @user20650, didn't mean to steal this. (For comments that are this clear, you might as well post as an answer ...) – Ben Bolker May 25 '20 at 17:30

1 Answers1

2

In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:

library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
    start = list(a = 0, br = 0), data = ig)

At this point, as @user20650 suggests, you'll also have to specify a value (or values) for n in newdata.

This model appears to be identical to binomial regression: is there a reason not to use

glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial) 

? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)

Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.

m_r <- mle2(v ~ dbinom(size=n,
        prob = 1/(1+exp(-(a + br * river_dist)))),
            start = list(a = 0, br = 0), data = ig,
            method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
              function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
              c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
    geom_point(aes(size=n)) +
    geom_linerange(aes(ymin=lwr,ymax=upr)) +
    geom_smooth(method="glm",
                method.args=list(family=binomial),
              aes(weight=n))+
    geom_line(data=pframe,aes(y=prop),colour="red")

enter image description here

Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • how do you decide on the `parscale` values please? Is this done after fitting the model once, then scaling on a second run or? – user20650 May 25 '20 at 18:28
  • 1
    you could run and rescale, or just set `parscale` to approximately `1/` – Ben Bolker May 25 '20 at 18:36
  • Wow, this is awesome! Thank you! I suspect my original instructor in R stats taught us mle2 so we'd be forced to think about our starting values explicitly; it's great to know about the glm function for the future. And that graph is gorgeous! You're a lifesaver. – Lauren Pratt May 26 '20 at 19:44
  • Side-note follow up: Is there a benefit to running the prediction from 500-1000 m, rather than the real min and max in my data? (475 and 1040 m?) – Lauren Pratt May 26 '20 at 20:33
  • matter of taste/preference, I think. – Ben Bolker May 26 '20 at 22:54