0

I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:

model  <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -7.888112   0.859847  -9.174  < 2e-16 ***
age                    0.028529   0.009212   3.097  0.00196 ** 
bmi                    0.095759   0.015265   6.273 3.53e-10 ***
surgery11              0.923723   0.524588   1.761  0.07826 .  
surgery21              1.607389   0.600113   2.678  0.00740 ** 
surgery31              1.544822   0.573972   2.691  0.00711 ** 
cvd1                   0.624692   0.290005   2.154  0.03123 *  
rt1                    -0.816374   0.353953  -2.306  0.02109 *  

I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...

residualPlots(model)

My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?

plot

Thanks

HKJ3
  • 387
  • 1
  • 10

1 Answers1

1

Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.

Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post

Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :

library(mlbench)
library(car)
data(PimaIndiansDiabetes)

table(PimaIndiansDiabetes$diabetes)
neg pos 
500 268

mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)

enter image description here

Let's make it more imbalanced, and you get a plot exactly like yours:

da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)

neg pos 
500 100 

mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you for your response. Does it matter if the response variable has loads of controls versus cases. Is the models still okay? I also want to know if these plots look okay i.e. there should be no correlation between residuals and predictors and fitted values. I guess for my numeric variables age and bmi it looks okay. How do I draw a boxplot for the binary variables e.g. surgery axilla, cvd and rt_axilla? I have coded them as factor variables but for some reason there aren't presented in boxplots? – HKJ3 Jan 05 '22 at 16:42
  • 1
    It should be ok, you can briefly check to see that not all your predictions are the control class, see https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression . One option is to weigh the cases more strongly – StupidWolf Jan 05 '22 at 17:22
  • 2
    There's not much point in plotting residues for binary variables. I think your model is ok, your point about correlation between residuals and predictors only holds for linear regression, it might not be the case for logistic, where your predicted value is a log odds, and actual value is binary – StupidWolf Jan 05 '22 at 17:29
  • 1
    see https://stats.stackexchange.com/questions/234998/logistic-regression-diagnostic-plots-in-r – StupidWolf Jan 05 '22 at 17:29
  • Thanks for your reply. I am bit confused because this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885900/ states checking for residuals vs predictor using a glm model. I thought this is necessary for both linear and logistic models? – HKJ3 Jan 05 '22 at 17:36