0

Recently I came to this problem that it takes a lot of time to throw out all of the insignificant variables from the model. I tried writing a function, but I would gladly take some advice. The best would be, if the function removed the variables one by one, always the one with the highest P value, until all of the variables are significant on 5%.

This is my "function":

x <- summary(model_test1)
x <- x$coefficients
x <- as.data.frame(x)
max_p <- function(x) {
  nameofmax <- rownames(which(x$`Pr(>|t|)` == max(x$`Pr(>|t|)`), arr.ind = TRUE))
  return(nameofmax)
}
  • 1
    Please show a small reproducible example – akrun Jan 05 '23 at 17:51
  • ```broom::tidy(model_test1) %>% filter(p.value < 0.05) ``` – megmac Jan 05 '23 at 17:57
  • 1
    I'd suggest looking up ridge regression and LASSO regression. These will be more effective approaches than removing variables one by one. – Jon Spring Jan 05 '23 at 18:08
  • `step(model_test1)` will perform stepwise regression and return the final model based on AIC. How to select variables can be controversial as discussed here https://freakonometrics.hypotheses.org/19925 There are also a set of methods that perform selection and fitting at the same time to eliminate biases from preprocessing. See the abess and glmnet R packages. – G. Grothendieck Jan 05 '23 at 21:06
  • Thanks for the fast and useful information for everyone! – matehorvath Jan 06 '23 at 22:26

1 Answers1

0

Up front, this is a simple (arguably naive) way to reduce step by step. There are most certainly better methods out there, most of which are taught in statistics classes (advanced or at least "robust" classes).

But in the interim, try this.

For the sake of getting something started, I'm "never" allowing the intercept to be discarded; this is a decision and often a safe bet, but there might be uses where one could consider removing it. When you get to that point, I suggest you will have more resources in your toolkit for analyzing it. (So I'm always keeping it for now.)

fun <- function(data, frm, threshold = 0.05, verbose = FALSE) {
  if (missing(frm)) frm <- reformulate(names(data)[-1], response = names(data)[1])
  while (TRUE) {
    if (verbose) print(frm)
    mdl <- lm(frm, data = data)
    coef <- summary(mdl)$coefficients
    if (verbose) print(coef)
    coef <- coef[rownames(coef) != "(Intercept)", ncol(coef)]
    drop <- which.max(coef)
    drop <- drop[coef[drop] > threshold]
    if (length(drop)) {
      if (verbose) message(paste("## drop:", names(drop), "=", round(coef[drop], 3)))
      frm <- drop.terms(terms(mdl), drop, keep.response = TRUE)
      attributes(frm) <- NULL # only to keep verbose printing clean
    } else break
  }
  list(formula = frm, model = mdl)
}

Demonstration on mtcars:

out$formula
# mpg ~ wt + qsec + am
summary(out$model)
# Call:
# lm(formula = frm, data = data)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.4811 -1.5555 -0.7257  1.4110  4.6610 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   9.6178     6.9596   1.382 0.177915    
# wt           -3.9165     0.7112  -5.507 6.95e-06 ***
# qsec          1.2259     0.2887   4.247 0.000216 ***
# am            2.9358     1.4109   2.081 0.046716 *  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 2.459 on 28 degrees of freedom
# Multiple R-squared:  0.8497,  Adjusted R-squared:  0.8336 
# F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

If you want to see what it does each step, do it verbosely.

out <- fun(mtcars, mpg ~ ., verbose = TRUE)
# mpg ~ .
# <environment: 0x0000021eb97c5f00>
#                Estimate  Std. Error    t value   Pr(>|t|)
# (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
# cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
# disp         0.01333524  0.01785750  0.7467585 0.46348865
# hp          -0.02148212  0.02176858 -0.9868407 0.33495531
# drat         0.78711097  1.63537307  0.4813036 0.63527790
# wt          -3.71530393  1.89441430 -1.9611887 0.06325215
# qsec         0.82104075  0.73084480  1.1234133 0.27394127
# vs           0.31776281  2.10450861  0.1509915 0.88142347
# am           2.52022689  2.05665055  1.2254035 0.23398971
# gear         0.65541302  1.49325996  0.4389142 0.66520643
# carb        -0.19941925  0.82875250 -0.2406258 0.81217871
# ## drop: cyl = 0.916
# mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
#                Estimate  Std. Error    t value   Pr(>|t|)
# (Intercept) 10.96007405 13.53030251  0.8100391 0.42659327
# disp         0.01282839  0.01682215  0.7625891 0.45380797
# hp          -0.02190885  0.02091131 -1.0477031 0.30615002
# drat         0.83519652  1.53625251  0.5436584 0.59214373
# wt          -3.69250814  1.83953550 -2.0073046 0.05715727
# qsec         0.84244138  0.68678068  1.2266527 0.23291993
# vs           0.38974986  1.94800204  0.2000767 0.84325850
# am           2.57742789  1.94034563  1.3283344 0.19768373
# gear         0.71155439  1.36561933  0.5210489 0.60753821
# carb        -0.21958316  0.78855537 -0.2784626 0.78325783
# ## drop: vs = 0.843
# mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
#                Estimate  Std. Error    t value   Pr(>|t|)
# (Intercept)  9.76827789 11.89230469  0.8213949 0.41985460
# disp         0.01214441  0.01612373  0.7532010 0.45897019
# hp          -0.02095020  0.01992567 -1.0514175 0.30398892
# drat         0.87509822  1.49112525  0.5868710 0.56300717
# wt          -3.71151106  1.79833544 -2.0638592 0.05049085
# qsec         0.91082822  0.58311935  1.5619928 0.13194532
# am           2.52390094  1.88128007  1.3415870 0.19282690
# gear         0.75984464  1.31577205  0.5774896 0.56921947
# carb        -0.24796312  0.75933250 -0.3265541 0.74695821
# ## drop: carb = 0.747
# mpg ~ disp + hp + drat + wt + qsec + am + gear
#                Estimate  Std. Error    t value    Pr(>|t|)
# (Intercept)  9.19762837 11.54220381  0.7968693 0.433339841
# disp         0.01551976  0.01214235  1.2781513 0.213420001
# hp          -0.02470716  0.01596302 -1.5477746 0.134763097
# drat         0.81022794  1.45006779  0.5587518 0.581507634
# wt          -4.13065054  1.23592980 -3.3421401 0.002717119
# qsec         1.00978651  0.48883274  2.0657097 0.049814778
# am           2.58979984  1.83528342  1.4111171 0.171042438
# gear         0.60644020  1.20596266  0.5028681 0.619640616
# ## drop: gear = 0.62
# mpg ~ disp + hp + drat + wt + qsec + am
#                Estimate  Std. Error    t value    Pr(>|t|)
# (Intercept) 10.71061639 10.97539399  0.9758753 0.338475309
# disp         0.01310313  0.01098299  1.1930387 0.244054196
# hp          -0.02179818  0.01465399 -1.4875257 0.149381426
# drat         1.02065283  1.36747598  0.7463772 0.462401185
# wt          -4.04454214  1.20558182 -3.3548467 0.002536163
# qsec         0.99072948  0.48002393  2.0639168 0.049550895
# am           2.98468801  1.63382423  1.8268110 0.079692318
# ## drop: drat = 0.462
# mpg ~ disp + hp + wt + qsec + am
#                Estimate Std. Error   t value    Pr(>|t|)
# (Intercept) 14.36190396 9.74079485  1.474408 0.152378367
# disp         0.01123765 0.01060333  1.059823 0.298972150
# hp          -0.02117055 0.01450469 -1.459565 0.156387279
# wt          -4.08433206 1.19409972 -3.420428 0.002075008
# qsec         1.00689683 0.47543287  2.117853 0.043907652
# am           3.47045340 1.48578009  2.335779 0.027487809
# ## drop: disp = 0.299
# mpg ~ hp + wt + qsec + am
#                Estimate Std. Error   t value    Pr(>|t|)
# (Intercept) 17.44019110  9.3188688  1.871492 0.072149342
# hp          -0.01764654  0.0141506 -1.247052 0.223087932
# wt          -3.23809682  0.8898986 -3.638726 0.001141407
# qsec         0.81060254  0.4388703  1.847021 0.075731202
# am           2.92550394  1.3971471  2.093913 0.045790788
# ## drop: hp = 0.223
# mpg ~ wt + qsec + am
#              Estimate Std. Error   t value     Pr(>|t|)
# (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
# wt          -3.916504  0.7112016 -5.506882 6.952711e-06
# qsec         1.225886  0.2886696  4.246676 2.161737e-04
# am           2.935837  1.4109045  2.080819 4.671551e-02

I should note that it seems counter-intuitive in this model that disp (displacement) is not influential in determining the fuel efficiency (mpg). I haven't actually studied this in depth on this dataset, but one should be careful using this "always drop the highest p-value", and not always accept its results with absolutely certainty. Caveat emptor.

r2evans
  • 141,215
  • 6
  • 77
  • 149