How do you predict outcomes from a new dataset using a model created from a different dataset in R?

Question

I could be missing something about prediction -- but my multiple linear regression is seemingly working as expected:

> bigmodel <- lm(score ~ lean + gender + age, data = mydata)
> summary(bigmodel)

Call:
lm(formula = score ~ lean + gender + age, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.891  -4.354   0.892   6.240  18.537 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 70.96455    3.85275  18.419   <2e-16 ***
lean         0.62463    0.05938  10.518   <2e-16 ***
genderM     -2.24025    1.40362  -1.596   0.1121    
age          0.10783    0.06052   1.782   0.0764 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9 on 195 degrees of freedom
Multiple R-squared:  0.4188,    Adjusted R-squared:  0.4098 
F-statistic: 46.83 on 3 and 195 DF,  p-value: < 2.2e-16

> head(predict(bigmodel),20)
       1        2        3        4        5        6        7        8        9       10 
75.36711 74.43743 77.02533 78.76903 79.95515 79.09251 80.38647 81.65807 80.14846 78.96234 
      11       12       13       14       15       16       17       18       19       20 
82.39052 82.04468 81.05187 81.26753 84.50240 81.80667 80.92169 82.40895 81.76197 82.94809

But I can't wrap my head around the prediction after reading ?predict.lm. This output looks good to me for my original dataset -- but what if I want to run the prediction against a different dataset than the one I used to create bigmodel?

For example, if I import a .csv file into R called newmodel with 200 people complete with leans, gender, and age -- how can I use the regression formula from bigmodel to produce predictions for newmodel?

Thanks!

Ramnath · Accepted Answer · 2014-04-18T00:39:40.420

3

If you read the documentation for predict.lm, you will see the following. So, use the newdata argument to pass the newmodel data you imported to get predictions.

predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
        interval = c("none", "confidence", "prediction"),
        level = 0.95, type = c("response", "terms"),
        terms = NULL, na.action = na.pass,
        pred.var = res.var/weights, weights = 1, ...)
Arguments

object  
Object of class inheriting from "lm"

newdata 
An optional data frame in which to look for variables with which to predict. 
If omitted, the fitted values are used.

UPDATE. On the question of exporting data with predictions, here is how you can do it.

predictions = cbind(newmodel, pred = predict(bigmodel, newdata = newmodel))
write.csv(predictions, 'predictions.csv', row.names = F)

UPDATE 2. A full minimally reproducible solution

bigmodel <- lm(mpg ~ wt, data = mtcars)
newdata = data.frame(wt = runif(20, min = 1.5, max = 6))

cbind(
  newdata,
  mpg = predict(bigmodel, newdata = newdata)
)

edited Apr 18 '14 at 00:39

answered Apr 17 '14 at 19:24

Ramnath

54,439
16
125
152

Ah! I was missing something. Thanks! I saw the `newdata` but thought it referred to an actual dataframe, rather than an argument. But it's actually both -- right? So I'd have to upload what I initially called `newmodel` as `newdata` and then it works. – Ryan Apr 17 '14 at 20:42
Also, is there a way to more easily export the prediction into a column in Excel? Or even better a way to export the prediction into the last column in the `newdata` .csv? – Ryan Apr 17 '14 at 20:44
To your first comment, yes, you will have to read your new data into the data frame `newmodel` (which is what i am assuming you are naming it as). – Ramnath Apr 17 '14 at 20:50
Thanks, but I think I'm still missing one more thing. To create my `bigmodel` object, I used the `mydata` dataset, which is 1707 rows, and my `newmodel` dataset is only 135. If I run my predict function as `predict(bigmodel, newdata = newmodel, type = "p")`, since I'm using `bigmodel` as the object, it's still trying to fit it to 1707 rows, giving me the error `'newdata' had 135 rows but variables found have 1707 rows`. What am I doing wrong here? – Ryan Apr 17 '14 at 21:48
It should not complain about rows, but it will complain about columns, since you should have all the independent variables in `newmodel`. Can you post some dummy data so we can reproduce your error? – Ramnath Apr 18 '14 at 00:36
I have posted a fully reproducible solution using dummy data. Note that `newdata` and `mtcars` have different number of rows. – Ramnath Apr 18 '14 at 00:40
Thanks for all the help! I think it has to do with certain blank values being imported to R as blanks instead of `NA`s -- which is something I'm trying to work out here if you can help: [http://stackoverflow.com/questions/23145430/how-can-i-make-sure-all-my-csv-data-gets-imported-as-na-instead-of-blank-in-r](http://stackoverflow.com/questions/23145430/how-can-i-make-sure-all-my-csv-data-gets-imported-as-na-instead-of-blank-in-r) After that gets fixed, I can check to see if this works. – Ryan Apr 19 '14 at 17:58

How do you predict outcomes from a new dataset using a model created from a different dataset in R?

1 Answers1