Up front, this is a simple (arguably naive) way to reduce step by step. There are most certainly better methods out there, most of which are taught in statistics classes (advanced or at least "robust" classes).
But in the interim, try this.
For the sake of getting something started, I'm "never" allowing the intercept to be discarded; this is a decision and often a safe bet, but there might be uses where one could consider removing it. When you get to that point, I suggest you will have more resources in your toolkit for analyzing it. (So I'm always keeping it for now.)
fun <- function(data, frm, threshold = 0.05, verbose = FALSE) {
if (missing(frm)) frm <- reformulate(names(data)[-1], response = names(data)[1])
while (TRUE) {
if (verbose) print(frm)
mdl <- lm(frm, data = data)
coef <- summary(mdl)$coefficients
if (verbose) print(coef)
coef <- coef[rownames(coef) != "(Intercept)", ncol(coef)]
drop <- which.max(coef)
drop <- drop[coef[drop] > threshold]
if (length(drop)) {
if (verbose) message(paste("## drop:", names(drop), "=", round(coef[drop], 3)))
frm <- drop.terms(terms(mdl), drop, keep.response = TRUE)
attributes(frm) <- NULL # only to keep verbose printing clean
} else break
}
list(formula = frm, model = mdl)
}
Demonstration on mtcars
:
out$formula
# mpg ~ wt + qsec + am
summary(out$model)
# Call:
# lm(formula = frm, data = data)
# Residuals:
# Min 1Q Median 3Q Max
# -3.4811 -1.5555 -0.7257 1.4110 4.6610
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 9.6178 6.9596 1.382 0.177915
# wt -3.9165 0.7112 -5.507 6.95e-06 ***
# qsec 1.2259 0.2887 4.247 0.000216 ***
# am 2.9358 1.4109 2.081 0.046716 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 2.459 on 28 degrees of freedom
# Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
# F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
If you want to see what it does each step, do it verbose
ly.
out <- fun(mtcars, mpg ~ ., verbose = TRUE)
# mpg ~ .
# <environment: 0x0000021eb97c5f00>
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
# cyl -0.11144048 1.04502336 -0.1066392 0.91608738
# disp 0.01333524 0.01785750 0.7467585 0.46348865
# hp -0.02148212 0.02176858 -0.9868407 0.33495531
# drat 0.78711097 1.63537307 0.4813036 0.63527790
# wt -3.71530393 1.89441430 -1.9611887 0.06325215
# qsec 0.82104075 0.73084480 1.1234133 0.27394127
# vs 0.31776281 2.10450861 0.1509915 0.88142347
# am 2.52022689 2.05665055 1.2254035 0.23398971
# gear 0.65541302 1.49325996 0.4389142 0.66520643
# carb -0.19941925 0.82875250 -0.2406258 0.81217871
# ## drop: cyl = 0.916
# mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 10.96007405 13.53030251 0.8100391 0.42659327
# disp 0.01282839 0.01682215 0.7625891 0.45380797
# hp -0.02190885 0.02091131 -1.0477031 0.30615002
# drat 0.83519652 1.53625251 0.5436584 0.59214373
# wt -3.69250814 1.83953550 -2.0073046 0.05715727
# qsec 0.84244138 0.68678068 1.2266527 0.23291993
# vs 0.38974986 1.94800204 0.2000767 0.84325850
# am 2.57742789 1.94034563 1.3283344 0.19768373
# gear 0.71155439 1.36561933 0.5210489 0.60753821
# carb -0.21958316 0.78855537 -0.2784626 0.78325783
# ## drop: vs = 0.843
# mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 9.76827789 11.89230469 0.8213949 0.41985460
# disp 0.01214441 0.01612373 0.7532010 0.45897019
# hp -0.02095020 0.01992567 -1.0514175 0.30398892
# drat 0.87509822 1.49112525 0.5868710 0.56300717
# wt -3.71151106 1.79833544 -2.0638592 0.05049085
# qsec 0.91082822 0.58311935 1.5619928 0.13194532
# am 2.52390094 1.88128007 1.3415870 0.19282690
# gear 0.75984464 1.31577205 0.5774896 0.56921947
# carb -0.24796312 0.75933250 -0.3265541 0.74695821
# ## drop: carb = 0.747
# mpg ~ disp + hp + drat + wt + qsec + am + gear
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 9.19762837 11.54220381 0.7968693 0.433339841
# disp 0.01551976 0.01214235 1.2781513 0.213420001
# hp -0.02470716 0.01596302 -1.5477746 0.134763097
# drat 0.81022794 1.45006779 0.5587518 0.581507634
# wt -4.13065054 1.23592980 -3.3421401 0.002717119
# qsec 1.00978651 0.48883274 2.0657097 0.049814778
# am 2.58979984 1.83528342 1.4111171 0.171042438
# gear 0.60644020 1.20596266 0.5028681 0.619640616
# ## drop: gear = 0.62
# mpg ~ disp + hp + drat + wt + qsec + am
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 10.71061639 10.97539399 0.9758753 0.338475309
# disp 0.01310313 0.01098299 1.1930387 0.244054196
# hp -0.02179818 0.01465399 -1.4875257 0.149381426
# drat 1.02065283 1.36747598 0.7463772 0.462401185
# wt -4.04454214 1.20558182 -3.3548467 0.002536163
# qsec 0.99072948 0.48002393 2.0639168 0.049550895
# am 2.98468801 1.63382423 1.8268110 0.079692318
# ## drop: drat = 0.462
# mpg ~ disp + hp + wt + qsec + am
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 14.36190396 9.74079485 1.474408 0.152378367
# disp 0.01123765 0.01060333 1.059823 0.298972150
# hp -0.02117055 0.01450469 -1.459565 0.156387279
# wt -4.08433206 1.19409972 -3.420428 0.002075008
# qsec 1.00689683 0.47543287 2.117853 0.043907652
# am 3.47045340 1.48578009 2.335779 0.027487809
# ## drop: disp = 0.299
# mpg ~ hp + wt + qsec + am
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 17.44019110 9.3188688 1.871492 0.072149342
# hp -0.01764654 0.0141506 -1.247052 0.223087932
# wt -3.23809682 0.8898986 -3.638726 0.001141407
# qsec 0.81060254 0.4388703 1.847021 0.075731202
# am 2.92550394 1.3971471 2.093913 0.045790788
# ## drop: hp = 0.223
# mpg ~ wt + qsec + am
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
# wt -3.916504 0.7112016 -5.506882 6.952711e-06
# qsec 1.225886 0.2886696 4.246676 2.161737e-04
# am 2.935837 1.4109045 2.080819 4.671551e-02
I should note that it seems counter-intuitive in this model that disp
(displacement) is not influential in determining the fuel efficiency (mpg
). I haven't actually studied this in depth on this dataset, but one should be careful using this "always drop the highest p-value", and not always accept its results with absolutely certainty. Caveat emptor.