I am trying to do robust multiple regression for a dataset where a few outliers don't allow me to see the underlying patterns through the usual linear models.
I am using the function lmrob
in the package robustbase
, and I was surprised by the number of significant relationships that I found. I decided to try the method with random data, this is the code:
library(robustbase)
set.seed(4)
ax<-data.frame(a1=rnorm(20,3),
a2=rnorm(20,5),
a3=rnorm(20,4),
a4=rnorm(20,6),
a5=rnorm(20,2))
axm<-lmrob(a1~a2*a3*a4*a5,data=ax)
summary(axm)
And the output:
Call:
lmrob(formula = a1 ~ a2 * a3 * a4 * a5, data = ax)
\--> method = "MM"
Residuals:
1 2 3 4 5 6 7 8 9 10 11 12 13
-34.740270 -0.049493 -0.044379 0.002770 0.219825 0.041285 0.156152 -0.072825 0.034824 -0.014757 -0.088263 -0.185045 -0.079679
14 15 16 17 18 19 20
-0.045121 -0.007576 0.008813 0.010451 0.015716 0.060781 0.040187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1160.5907 94.0095 -12.35 0.000247 ***
a2 205.6910 15.8689 12.96 0.000204 ***
a3 327.9787 24.2161 13.54 0.000172 ***
a4 193.2384 15.7300 12.29 0.000252 ***
a5 734.2203 49.8960 14.71 0.000124 ***
a2:a3 -57.6229 4.0533 -14.22 0.000142 ***
a2:a4 -33.5644 2.6130 -12.85 0.000212 ***
a3:a4 -54.1622 4.0438 -13.39 0.000180 ***
a2:a5 -138.8395 9.2697 -14.98 0.000116 ***
a3:a5 -198.4961 12.3168 -16.12 8.67e-05 ***
a4:a5 -123.0895 8.2792 -14.87 0.000119 ***
a2:a3:a4 9.3344 0.6659 14.02 0.000150 ***
a2:a3:a5 37.1371 2.2502 16.50 7.89e-05 ***
a2:a4:a5 23.0014 1.5152 15.18 0.000110 ***
a3:a4:a5 32.9766 2.0388 16.18 8.55e-05 ***
a2:a3:a4:a5 -6.0817 0.3660 -16.62 7.68e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robust residual standard error: 0.4039
Multiple R-squared: 0.9861, Adjusted R-squared: 0.934
Convergence in 5 IRWLS iterations
Robustness weights:
observation 1 is an outlier with |weight| = 0 ( < 0.005);
9 weights are ~= 1. The remaining 10 ones are
2 3 5 7 8 11 12 13 14 19
0.9986 0.9989 0.9732 0.9864 0.9970 0.9957 0.9810 0.9965 0.9989 0.9979
Algorithmic parameters:
tuning.chi bb tuning.psi refine.tol rel.tol scale.tol solve.tol eps.outlier
1.548e+00 5.000e-01 4.685e+00 1.000e-07 1.000e-07 1.000e-10 1.000e-07 5.000e-03
eps.x warn.limit.reject warn.limit.meanrw
1.150e-09 5.000e-01 5.000e-01
nResample max.it best.r.s k.fast.s k.max maxit.scale trace.lev mts compute.rd
500 50 2 1 200 200 0 1000 0
fast.s.large.n
2000
psi subsampling cov compute.outlier.stats
"bisquare" "nonsingular" ".vcov.avar1" "SM"
seed : int(0)
According to this, I understand that the other random variables are related to the first one, and have high predictive power over it, which makes no sense.
What is happening here? I am doing the regression wrong?
Edit: I put a seed for which the p values are extremely low for replicability.