2

Below is a set of fictitious probability data, which I converted into binomial with a threshold of 0.5. I ran a glm() model on the discrete data to test if the intervals returned from glm() were 'mean prediction intervals' ("Confidence Interval") or 'point prediction intervals'("Prediction Interval"). It appears from the plot below that the returned intervals are the latter--'Point Prediction Intervals'; note, with 95% confidence, 2/20 points fall outside of the line in this sample.

If this is indeed the case, how do I generate the the 'mean prediction interval' (i.e, "Confidence Intervals") in R for a binomial data set bound by 0 and 1 using glm()? Please show your code and plot similar to mine with the fit line, given probabilities, 'confidence intervals' and 'prediction intervals'.

# Fictitious data
xVal <- c(15,15,17,18,32,33,41,42,47,50,
         53,55,62,63,64,65,66,68,70,79,
         94,94,94,95,98)
randRatio <- c(.01,.03,.05,.04,.01,.2,.1,.08,.88,.2,
               .2,.99,.49,.88,.2,.88,.66,.87,.66,.90,
               .98,.88,.95,.95,.95)
# Converted to binomial
randBinom <- ifelse(randRatio < .5, 0, 1)

# Data frame for model
binomData <- data.frame(
  randBinom = randBinom,
  xVal = xVal
)

# Model
mode1 <- glm(randBinom~ xVal, data = binomData, family = binomial(link = "logit"))

# Predict all points in xVal range
frame <- data.frame(xVal=(0:100))
predAll <- predict(mode1, newdata = frame,type = "link", se.fit=TRUE)

# Params for intervals and plot
confidence <- .95
score <- qnorm((confidence / 2) + .5)
frame <- data.frame(xVal=(0:100))

#Plot
with(binomData, plot(xVal, randBinom, type="n", ylim=c(0, 1), 
                 ylab = "Probability", xlab="xVal"))
lines(frame$xVal, plogis(predAll$fit), col = "red", lty = 1)
lines(frame$xVal, plogis(predAll$fit + score * predAll$se.fit), col = "red", lty = 3)
lines(frame$xVal, plogis(predAll$fit - score * predAll$se.fit), col = "red", lty = 3)
points(xVal, randRatio, col = "red") # Original probabilities
points(xVal, randBinom, col = "black", lwd = 3) # Binomial Points used in glm

Here's the plot, presumably with 'point prediction intervals' (i.e., "Prediction Intervals") in dashed red, and the mean fit in solid red. Black dots represent the discrete binomial data from original probabilities in randRatio:

enter image description here

Bob Hopez
  • 773
  • 4
  • 10
  • 28
  • I think your premise is incorrect. I think you are not seeing what you are calling "point prediction intervals" and what most people simply call "prediction intervals". What you are calling 'mean prediction intervals' is (probably) what most people would call "confidence intervals" and those apply to plausible locations of the estimated parameter. – IRTFM Aug 05 '16 at 21:46
  • @42- I edited some of the wording to align better with your comment. – Bob Hopez Aug 05 '16 at 23:38
  • @ZheyuanLi Please see modified question. I'm interested to see your solution, and even more so, if there is a way using glm(). Using predict() on lm() with "confidence" or "prediction" doesn't seem to be an option with glm(). See: http://stackoverflow.com/questions/12544090/predict-lm-in-r-how-to-get-nonconstant-prediction-bands-around-fitted-values – Bob Hopez Aug 05 '16 at 23:54
  • Using type = `link` gives you confidence intervals (on the logit scale). You are presenting them on the probability scale but they are still not prediction intervals. – IRTFM Aug 06 '16 at 00:06
  • @42- The above uses `type = link`. I then hear you saying the above plot represents the confidence interval. Is there a type argument for prediction intervals in glm? It appears more like a prediction interval to me, given the data--and that it covers ~95% of the data (red dots). – Bob Hopez Aug 06 '16 at 00:13
  • 1
    Think about it a bit. A "prediction" for an "Y" value in a binomial situation would need to be either 1 or 0. None of the `predict.glm` values are either of those numbers. – IRTFM Aug 06 '16 at 00:37
  • `predict.glm` has no provision for `interval ="prediction"`. If you wnat to see more from R-Core guRus consider doing a bit of searching: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+prediction+intervals+for+glm+#query:list%3Aorg.r-project.r-help%20prediction%20intervals%20for%20glm%20+page:4+mid:mihtbvrv656mp6nl+state:results – IRTFM Aug 06 '16 at 00:43
  • @ZheyuanLi See bethany's comments below and link to generating the prediction interval. I'm not sure if her suggested method would work for glm binomial problem. Perhaps you're willing to attack the problem using her suggeseted approach. – Bob Hopez Aug 06 '16 at 17:22

1 Answers1

2

I am not sure if you are asking for the straight up prediction interval, but if you are you can calculate it simply.

You can extract a traditional confidence interval for the model as such:

confint(model)

And then once you run a prediction, you can calculate a prediction interval based on the prediction like so:

upper = predAll$fit + 1.96 * predAll$se.fit
lower = predAll$fit - 1.96 * predAll$se.fit

You are simply taking the prediction (at any given point if you use a single set of predictor variables) and adding and subtracting 1.96 * absolute value of the standard error. (1.96 se includes 97.5% of the normal distribution and represents the 95% interval as it does for the standard deviation in the normal distribution)

This is the same formula that you would use for a traditional confidence interval except that using the standard error (as opposed to the standard deviation) makes the interval wider to account for the uncertainty in prediction itself.

Update:

Method for plotting prediction invervals courtesy of Rstudio!

As requested...though not done by me!

Bob Hopez
  • 773
  • 4
  • 10
  • 28
sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • Thanks for your approach. I would kindly challenge you to create a plot with both the 'confidence interval' and 'prediction interval' along with the complete code. – Bob Hopez Aug 06 '16 at 14:02
  • Why reinvent the wheel...here is a solid concise and brainy way of doing this with ggplot2: – sconfluentus Aug 06 '16 at 15:49
  • And those can be used with GLM as well. – sconfluentus Aug 06 '16 at 15:54
  • Thanks; link was broken, but found [here](http://rstudio-pubs-static.s3.amazonaws.com/7024_e3a68a9b35454e74abfe15b621c50502.html). I'm not convinced that SE and STDEV calcs used in linear regression can be applied the same way in logistic regression. Challenge still stands. :) – Bob Hopez Aug 06 '16 at 17:06
  • Will try it... Or if someone, yourself included, wants to post the answer herein; I'll give them an upvote. – Bob Hopez Aug 06 '16 at 17:10
  • I will leave that test to you. I have used this method prior and it works. – sconfluentus Aug 06 '16 at 17:49
  • Did you use the binary residuals or the ratio ('probability') residuals? If the sigmoid changes at the beginning or end of the x-range then it will generate lots of skew with either residuals. – Bob Hopez Aug 06 '16 at 18:02
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120319/discussion-between-bob-hopez-and-bethanyp). – Bob Hopez Aug 06 '16 at 18:11