15

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation. This is the function itself:

## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
        interval = c("none", "confidence", "prediction"),
        level = 0.95, type = c("response", "terms"),
        terms = NULL, na.action = na.pass,
        pred.var = res.var/weights, weights = 1, ...)

Now, what I've trouble understanding:

    1) newdata  
    An optional data frame in which to look for variables 
    with which to predict. If omitted, the fitted values are used.
  • Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?

    2) interval
    Type of interval calculation.

  • okay.. but what is "none" for?

    3a) type
    Type of prediction (response or model term).

    3b) terms
    If type="terms", which terms (default is all terms)

  • 3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.

I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!

EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.

AdamO
  • 4,283
  • 1
  • 27
  • 39
lisa
  • 640
  • 5
  • 10
  • 26
  • The Details section of the documentation discusses the `newdata` argument at some length. What part of that discussion remains confusing? – joran Sep 22 '12 at 13:27
  • I guess this is what confuses me: "predict.lm produces predicted values, obtained by evaluating the regression function in the frame newdata" (in the general explanation) and "If newdata is omitted the predictions are based on the data used for the fit." for newdata. Why would I try to get confidence intervals which are not connected in my actual regression? This is how I understand this sentence, so this is what confuses me. Then it explains how missing values are handled in that case, but I struggle with the first part already! – lisa Sep 22 '12 at 13:50
  • 2
    You might be interested in _prediction_ intervals for new observations. – joran Sep 22 '12 at 14:00
  • Oh, alright! This helps a lot (+1), so in the newdata I would put all the datapoints that I want to be predicted? So, not the ones that I already have, but the others? Or can I put both into that? So I'd get a line, even there where I do not have any?! – lisa Sep 22 '12 at 14:06
  • Then it totally confuses my why the newdata would have to have the same amount of observations as the fitted model?! – lisa Sep 22 '12 at 14:08
  • 1
    `newdata` does *not* need to have the same number of observations as the fitted model ... – Ben Bolker Sep 22 '12 at 14:10
  • Okay I see, it does not necessarily have the same number, but I get a warning about that. Then, when I try to plot it (with lines()), it says x and y variables have diff length, even though the matrix I got by predict.lm() is 30x3 (even though my newdata has only 20) – lisa Sep 22 '12 at 14:22
  • I think we would need a reproducible example to see what's going wrong. See my answer below ... – Ben Bolker Sep 22 '12 at 14:33

1 Answers1

22

Make up some data:

d <- data.frame(x=c(1,4,5,7),
                y=c(0.8,4.2,4.7,8))

Fit the model:

lm1 <- lm(y~x,data=d)

Confidence and prediction intervals with the original x values:

p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")

Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):

nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)

Plotting everything together:

par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)

enter image description here

Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...

I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".

interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Can maybe some try to explain the difference between a prediction and a confidence interval in a few words? This is how I understood it: CI gives you a clue about where the mean of the population used is likely to fall in 95% of the times. The PI on the other hand is not about the mean, but about the future values, this is y values which are not in you data yet. Is this somehow correct? – lisa Sep 23 '12 at 13:46
  • 2
    May I suggest that you google '"prediction interval" "confidence interval"' ... ? the answers are out there ... if you don't get what you need there, then you should probably ask on http://stats.stackexchange.com , as we have gotten beyond the realm of programming ... also: http://stackoverflow.com/questions/9406139/r-programming-predict-prediction-vs-confidence – Ben Bolker Sep 23 '12 at 14:03