R: modeling on residuals

Question

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.

I'm interested in the following:

If I'm using a glm, how do I account for the link function used?
What distribution do I choose when running a second glm with var_2 as explanatory variable? I assume this is related to 1.
Is this at all related to using the first models prediction as an offset in the second model?

My attempt:

dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]

# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]

# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]

My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!

I'm wondering whether this question is more likely to find an answer on [Cross Validated](https://stats.stackexchange.com/), assuming that you write the post with less focus on the code and more on the validity of the approach. — slamballais, May 29 '21 at 12:53
I don't have an answer, but a similar question was asked [here](https://stats.stackexchange.com/questions/47659/residualize-binary-outcome-variable). One commenter (on the accepted answer) adds a note on different types of residuals, which is further covered [here](https://stats.stackexchange.com/questions/1432/what-do-the-residuals-in-a-logistic-regression-mean). I found the answer by Maverick Meerkat particularly useful. — slamballais, May 29 '21 at 13:02
@slamballais yes, maybe you are right, that would be a good idea. I feel like I've invested so many points, though, I'm commited :D — Helen, May 30 '21 at 10:07
perhaps relevant https://stats.stackexchange.com/questions/368369/regressing-logistic-regression-residuals-on-other-regressors, https://besjournals.onlinelibrary.wiley.com/doi/full/10.1046/j.1365-2656.2002.00618.x, https://stats.stackexchange.com/questions/244870/how-to-partial-out-a-covariate-in-logistic-regression — user20650, Jun 03 '21 at 12:20
@user20650 thank you for the tips, much appreciated! I think I have to make a crossvalidated post :) — Helen, Jun 03 '21 at 12:54

score 0 · Answer 1 · answered Jun 03 '21 at 13:19

In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.

The following regression is equivalent to this ANCOVA:

lm(y ~ treatment * scale(covariate, scale = FALSE))

Which applied to the data would look like this:

lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)

And can be turned into a glm similar to the one you use in your example:

glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), 
    family=Gamma(link="log"), 
    data = mtcars)

R: modeling on residuals

1 Answers1