I have binary outcome variable and 4 predictors: 2 binary one and 2 continuous (truncated to whole numbers). I have some 1158 observations and the objective of the analysis is to predict the probability of the binary outcome (infection), check Goodness of fit and predictive quality of the final model.
> str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1158 obs. of 5 variables:
$ age : num 25 49 41 19 55 37 30 31 52 37 ...
$ gender: num 1 1 1 0 0 0 1 0 1 1 ...
$ var1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ y : num 1 0 0 1 1 0 1 1 0 1 ...
$ var2 : num 26 33 25 30 28 20 28 21 17 25 ...
I have seen that the data is sometimes split in 2: testing and training data set, but not always. I assume this depends on the original sample size? Is it advisable to split the data for my analysis?
For now, I have not split the data. I conducted varius variable selection procedures:
- manual LRT based backward selection manual
- LRT based forward selection automated
- LRT based backward selection
- AIC backward selection procedure
- AIC forward selection procedure
And the all lead to the same results: Only age and gender should be included in my model.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2716 -0.8767 -0.7361 1.3008 1.9353
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.785753 0.238634 3.293 0.000992 ***
age -0.031504 0.004882 -6.453 1.1e-10 ***
gender -0.223195 0.129774 -1.720 0.085455 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1444.9 on 1157 degrees of freedom
Residual deviance: 1398.7 on 1155 degrees of freedom
AIC: 1404.7
Now, i want to see if any interactions or polynomials are significant. The dot (.) denotes the full model with 4 predictors.
full.twoway <- glm(y ~ (.)^2 , family = binomial, data=data) # includes 2-way interactions
summary(full.twoway)
model.aic.backward_2w <- step(full.twoway, direction = "backward", trace = 1)
summary(model.aic.backward_2w)
full.treeway <- glm(y ~ (.)^3 , family = binomial, data=data) # includes 3-way interactions
summary(full.treeway)
full.treeway <- glm(reject ~ (.)^3 , family = binomial, data=renal) # includes 3-way interactions
summary(full.treeway)
# significant interaction: age:male:cardio at 0.5
model.aic.backward_3w <- step(full.treeway, direction = "backward", trace = 1)
summary(model.aic.backward_3w)
# polynomials
model.polynomial <- glm(y ~ age + gender + I(age^2), family = binomial, data=data)
# only age, gender significant
Also only age and gender are significant. This seems very strange to me. I would have expected some interaction or polynomial term to be significant. Am I doing something wrong? Are there some other variable selection techniques?
EDIT:
I have partitioned the dataset in training and testing. Training dataset consist of 868 observations. The results of the selection procedure indicate that only the variable age
is significant now...