1

I am trying to do robust multiple regression for a dataset where a few outliers don't allow me to see the underlying patterns through the usual linear models. I am using the function lmrob in the package robustbase, and I was surprised by the number of significant relationships that I found. I decided to try the method with random data, this is the code:

library(robustbase)
set.seed(4)
ax<-data.frame(a1=rnorm(20,3),
               a2=rnorm(20,5),
               a3=rnorm(20,4),
               a4=rnorm(20,6),
               a5=rnorm(20,2))
 
axm<-lmrob(a1~a2*a3*a4*a5,data=ax)
summary(axm)

And the output:

        Call:
lmrob(formula = a1 ~ a2 * a3 * a4 * a5, data = ax)
 \--> method = "MM"
Residuals:
         1          2          3          4          5          6          7          8          9         10         11         12         13 
-34.740270  -0.049493  -0.044379   0.002770   0.219825   0.041285   0.156152  -0.072825   0.034824  -0.014757  -0.088263  -0.185045  -0.079679 
        14         15         16         17         18         19         20 
 -0.045121  -0.007576   0.008813   0.010451   0.015716   0.060781   0.040187 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.5907    94.0095  -12.35 0.000247 ***
a2            205.6910    15.8689   12.96 0.000204 ***
a3            327.9787    24.2161   13.54 0.000172 ***
a4            193.2384    15.7300   12.29 0.000252 ***
a5            734.2203    49.8960   14.71 0.000124 ***
a2:a3         -57.6229     4.0533  -14.22 0.000142 ***
a2:a4         -33.5644     2.6130  -12.85 0.000212 ***
a3:a4         -54.1622     4.0438  -13.39 0.000180 ***
a2:a5        -138.8395     9.2697  -14.98 0.000116 ***
a3:a5        -198.4961    12.3168  -16.12 8.67e-05 ***
a4:a5        -123.0895     8.2792  -14.87 0.000119 ***
a2:a3:a4        9.3344     0.6659   14.02 0.000150 ***
a2:a3:a5       37.1371     2.2502   16.50 7.89e-05 ***
a2:a4:a5       23.0014     1.5152   15.18 0.000110 ***
a3:a4:a5       32.9766     2.0388   16.18 8.55e-05 ***
a2:a3:a4:a5    -6.0817     0.3660  -16.62 7.68e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Robust residual standard error: 0.4039 
Multiple R-squared:  0.9861,    Adjusted R-squared:  0.934 
Convergence in 5 IRWLS iterations

Robustness weights: 
 observation 1 is an outlier with |weight| = 0 ( < 0.005); 
 9 weights are ~= 1. The remaining 10 ones are
     2      3      5      7      8     11     12     13     14     19 
0.9986 0.9989 0.9732 0.9864 0.9970 0.9957 0.9810 0.9965 0.9989 0.9979 
Algorithmic parameters: 
       tuning.chi                bb        tuning.psi        refine.tol           rel.tol         scale.tol         solve.tol       eps.outlier 
        1.548e+00         5.000e-01         4.685e+00         1.000e-07         1.000e-07         1.000e-10         1.000e-07         5.000e-03 
            eps.x warn.limit.reject warn.limit.meanrw 
        1.150e-09         5.000e-01         5.000e-01 
     nResample         max.it       best.r.s       k.fast.s          k.max    maxit.scale      trace.lev            mts     compute.rd 
           500             50              2              1            200            200              0           1000              0 
fast.s.large.n 
          2000 
                  psi           subsampling                   cov compute.outlier.stats 
           "bisquare"         "nonsingular"         ".vcov.avar1"                  "SM" 
seed : int(0) 

According to this, I understand that the other random variables are related to the first one, and have high predictive power over it, which makes no sense.

What is happening here? I am doing the regression wrong?

Edit: I put a seed for which the p values are extremely low for replicability.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Antón
  • 112
  • 1
  • 8
  • Running it a few more times you'll see you will get very different results. This is pure overfitting (it doesn't help that you have included all possible interactions). – user2974951 Jun 07 '22 at 08:42
  • @user2974951 yeah, I noticed it changes, usually ciclying 2 or 3 different options. In the real model with my data I only test the interaction of 1 variable with the rest, but the results look like overfitting anyway. Is this function useful at all to know what variables have a significant effect? – Antón Jun 07 '22 at 08:55
  • I should mention that the model that you are showing us (and the model results) come from a regular linear model, not from lmrob. – user2974951 Jun 07 '22 at 08:58
  • The probability of your code giving output with such low p-values is very low. – Roland Jun 07 '22 at 09:08
  • @user2974951 oh, my mistake, the code was wrong because I tried afterwards with lm() and copied the wrong one, but the output is actually from lmrob(). With lm the p values are much higher, around 0.7 – Antón Jun 07 '22 at 09:13
  • @Roland I've run the code a few times, and almost always most variables and interactions are significative, although not always this much. In any case, it happens many more times than it would be expected by chance if those p values were true – Antón Jun 07 '22 at 09:16
  • I've run the simulation a few times, too. If all coefficients are significant, they are usually pretty close to 5 %. I don't understand what you mean by "it happens many more times than it would be expected by chance if those p values were true". We know that they are not true. – Roland Jun 07 '22 at 09:23
  • I can't replicate this. Can you (1) use `set.seed()` at the beginning of your example and (2) paste the actual `lmrob` output? – Ben Bolker Jun 07 '22 at 09:24
  • @BenBolker done! Tried a few seeds to find one with extreme results as in my first try, but most will render significant p values – Antón Jun 07 '22 at 10:36

1 Answers1

0

I think I may have found the explanation to such high p-values: turns out that the MM estimate with small sample sizes (as in my example with 20) is prone to type 1 error. One of the authors of the robustbase package has published an article proposing an alternative estimate for this cases, but I'm afraid that for my data it doesn't work so much better.

Antón
  • 112
  • 1
  • 8