7

I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.

I'm interested in the following:

  1. If I'm using a glm, how do I account for the link function used?
  2. What distribution do I choose when running a second glm with var_2 as explanatory variable? I assume this is related to 1.
  3. Is this at all related to using the first models prediction as an offset in the second model?

My attempt:

dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]

# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]

# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]

My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!

Progman
  • 16,827
  • 6
  • 33
  • 48
Helen
  • 533
  • 12
  • 37
  • 1
    I'm wondering whether this question is more likely to find an answer on [Cross Validated](https://stats.stackexchange.com/), assuming that you write the post with less focus on the code and more on the validity of the approach. – slamballais May 29 '21 at 12:53
  • 1
    I don't have an answer, but a similar question was asked [here](https://stats.stackexchange.com/questions/47659/residualize-binary-outcome-variable). One commenter (on the accepted answer) adds a note on different types of residuals, which is further covered [here](https://stats.stackexchange.com/questions/1432/what-do-the-residuals-in-a-logistic-regression-mean). I found the answer by Maverick Meerkat particularly useful. – slamballais May 29 '21 at 13:02
  • 1
    @slamballais yes, maybe you are right, that would be a good idea. I feel like I've invested so many points, though, I'm commited :D – Helen May 30 '21 at 10:07
  • 2
    perhaps relevant https://stats.stackexchange.com/questions/368369/regressing-logistic-regression-residuals-on-other-regressors, https://besjournals.onlinelibrary.wiley.com/doi/full/10.1046/j.1365-2656.2002.00618.x, https://stats.stackexchange.com/questions/244870/how-to-partial-out-a-covariate-in-logistic-regression – user20650 Jun 03 '21 at 12:20
  • 1
    @user20650 thank you for the tips, much appreciated! I think I have to make a crossvalidated post :) – Helen Jun 03 '21 at 12:54

1 Answers1

0

In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.

The following regression is equivalent to this ANCOVA:

lm(y ~ treatment * scale(covariate, scale = FALSE))

Which applied to the data would look like this:

lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)

And can be turned into a glm similar to the one you use in your example:

glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), 
    family=Gamma(link="log"), 
    data = mtcars)
Tim
  • 697
  • 2
  • 9