6

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.

So something like this:

set.seed(123)                                              
y <- rnorm(10)                                           
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)  
summary(regr)
terms <- predict.lm(regr,m, type="terms")

In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:

coefficients(regr)[2]*m$v1

But that gives different results compared to the predict function.

Own calculation:

0.55293884  0.16253411  0.18103537  0.04999729 -0.25108302  0.80717945  0.22488764 -0.88835486  0.31681455 -0.21356803

And predict function calculation:

0.45870070  0.06829597  0.08679724 -0.04424084 -0.34532115  0.71294132  0.13064950 -0.98259299  0.22257641 -0.30780616

The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?

  • 2
    With `type = 'response"`, it should be `(model.matrix(~v1 + v2 + v3, m) %*% coefficients(regr))[,1]` giving the same as `predict.lm(regr, m, type = "response")` From the `terms`, you can do `rowSums(terms) + attr(terms, "constant")` – akrun Dec 17 '17 at 10:11
  • Does this answer your question? [Prediction Components in R Linear Regression](https://stackoverflow.com/questions/32385992/prediction-components-in-r-linear-regression) – William Chiu May 28 '22 at 20:16

2 Answers2

7

All the following lines result in the same predictions:

# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
  coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3

# prediction using predict function
predict.lm(regr,m)

# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')

You can read more about using type="terms" here.

The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:

# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)

# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)

Hope this helps.

Florian
  • 24,425
  • 4
  • 49
  • 80
  • Thanks! Didn't realized the constant also was different, so that's why the individual terms didn't add up to the total prediction (if you use the constant from the coefficients and not the terms). Still a bit incovenient the terms are centred around the mean (anyone know of a package that does differently? I cannot find it) but I can work woith this:) – Tall Measure Dec 19 '17 at 09:07
0

The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.

TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))


library(tidyverse)

### simulate data

set.seed(123)

nobs <- 50

x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)

y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)

df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)

train <- 1:round(0.7*nobs,0)

rm(x1, x2, y)

trainData <- df[train,]
testData <- df[-train,]

### linear model

mod <- lm(y ~ x1 + x2 , data=trainData)

summary(mod)


### predict test set

test_preds <- predict(mod, newdata=testData)

head(test_preds)

### contribution by predictor

test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))

colnames(test_contribution) <- names(coef(mod))

head(test_contribution)

all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5))  ## should be true

### Visualize each contribution

test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]

test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")

names(test_contribution_df_long)

ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
  geom_line() +
  theme_bw()
William Chiu
  • 388
  • 3
  • 19