binary logistic regression - model selection basics

Question

I have binary outcome variable and 4 predictors: 2 binary one and 2 continuous (truncated to whole numbers). I have some 1158 observations and the objective of the analysis is to predict the probability of the binary outcome (infection), check Goodness of fit and predictive quality of the final model.

>   str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1158 obs. of  5 variables:
 $ age   : num  25 49 41 19 55 37 30 31 52 37 ...
 $ gender: num  1 1 1 0 0 0 1 0 1 1 ...
 $ var1  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ y     : num  1 0 0 1 1 0 1 1 0 1 ...
 $ var2  : num  26 33 25 30 28 20 28 21 17 25 ...

I have seen that the data is sometimes split in 2: testing and training data set, but not always. I assume this depends on the original sample size? Is it advisable to split the data for my analysis?

For now, I have not split the data. I conducted varius variable selection procedures:

manual LRT based backward selection manual
LRT based forward selection automated
LRT based backward selection
AIC backward selection procedure
AIC forward selection procedure

And the all lead to the same results: Only age and gender should be included in my model.

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2716  -0.8767  -0.7361   1.3008   1.9353



Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.785753   0.238634   3.293 0.000992 ***
age         -0.031504   0.004882  -6.453  1.1e-10 ***
gender      -0.223195   0.129774  -1.720 0.085455 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1444.9  on 1157  degrees of freedom
Residual deviance: 1398.7  on 1155  degrees of freedom
AIC: 1404.7

Now, i want to see if any interactions or polynomials are significant. The dot (.) denotes the full model with 4 predictors.

  full.twoway <- glm(y ~ (.)^2 , family = binomial, data=data)  # includes 2-way interactions
  summary(full.twoway)

  model.aic.backward_2w <- step(full.twoway, direction = "backward", trace = 1)
  summary(model.aic.backward_2w)

  full.treeway <- glm(y ~ (.)^3 , family = binomial, data=data) # includes 3-way interactions
  summary(full.treeway)

full.treeway <- glm(reject ~ (.)^3 , family = binomial, data=renal) # includes 3-way interactions
  summary(full.treeway)
  # significant interaction: age:male:cardio at 0.5
model.aic.backward_3w <- step(full.treeway, direction = "backward", trace = 1)
  summary(model.aic.backward_3w)

    # polynomials
    model.polynomial  <- glm(y ~ age + gender + I(age^2), family = binomial, data=data)  
    # only age, gender significant

Also only age and gender are significant. This seems very strange to me. I would have expected some interaction or polynomial term to be significant. Am I doing something wrong? Are there some other variable selection techniques?

EDIT: I have partitioned the dataset in training and testing. Training dataset consist of 868 observations. The results of the selection procedure indicate that only the variable age is significant now...

binary logistic regression - model selection basics

0 Answers0