2

I have this code

rm(list=ls())
N = 20000
xvar <- runif(N, -10, 10) 
e <- rnorm(N, mean=0, sd=1)
yvar <- 1 + 2*xvar + e
plot(xvar,yvar)
lmMod <- lm(yvar~xvar)
print(summary(lmMod))

I expected the coefficients to be something similar to [1,2].

Instead, with N =20000, R keeps throwing at me random numbers that are not statistically significant and don't fit the model, the $R^2$ is really low..I just don't see what I am doing wrong. Here in an example output:

Call:
lm(formula = yvar ~ xvar)

Residuals:
   Min     1Q Median     3Q    Max 
-47.23  -9.10   1.24  11.23  23.74 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.03163    0.08291   0.381  0.70286   
xvar         0.04290    0.01427   3.006  0.00265 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.73 on 19998 degrees of freedom
Multiple R-squared:  0.0009635, Adjusted R-squared:  0.0009135 
F-statistic: 19.29 on 1 and 19998 DF,  p-value: 1.131e-05

However, if I put N=200 or N=2000, it works. The coefficients resemble the real ones, and are inside two standard deviations of the real ones, and I get $R^2$ values as high as 99%, and the coefficients are all statistically significant with $p<<0.01$.

What is happening here? why increasing the number of observations worsen the regression? Is R covertly experiencing problems of numerical stability?

I am running R 3.6.0 on Kubuntu 19.04. The same problem happens also by running R on the command line using the --vanilla option.

EDIT: here is the output of sessioninfo()

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.04

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.0 tools_3.6.0
robertspierre
  • 3,218
  • 2
  • 31
  • 46
  • Works as expected to me. Try running the code on a clean session, though I think just rerunning *all of it* should also suffice. – Julius Vainora May 25 '19 at 13:59
  • @JuliusVainora I have tried a clean session, the command line with the --vanilla option, nothing. I don't know what is going on – robertspierre May 25 '19 at 14:00
  • Please include also the output of `sessionInfo()`. Numerical issues are hard to believe since 20000 is not that much. – Julius Vainora May 25 '19 at 14:02
  • @JuliusVainora done – robertspierre May 25 '19 at 14:04
  • Do densities/histograms of `xvar` and `e` and `plot(xvar, yvar)` look as they should (so that the problem is in `lm`)? – Julius Vainora May 25 '19 at 14:07
  • 1
    @JuliusVainora yes, it was due to Intel MKL. Uninstalling Intel MKL and using OpenBLAS instead solved the problem. That is incredible, isn't it? (p.s. I had the idea when I saw the output of the sessioninfo() that you suggested. Thanks!) – robertspierre May 25 '19 at 14:08

2 Answers2

1

It was due to Intel MKL. Uninstalling Intel MKL and using OpenBLAS instead solved the problem.

robertspierre
  • 3,218
  • 2
  • 31
  • 46
  • Did you find a solution to fix the issue with MKL? With Ubuntu 19.10 and R 3.6.1 I get exactly the same behavior you reported here. – AlefSin Feb 05 '20 at 14:17
  • @AlefSin no unfortunately not. I wanted to report it as a bug but Intel's bug reporting system is messed up – robertspierre Feb 05 '20 at 15:02
1

As noted in my previous answer, this was due to using Intel's MKL as the BLAS backend.

However, it seems fixed with the current Intel MKL, as this code shows:

library(flexiblas)

mkl <- "$HOME/intel/oneapi/mkl/latest/lib/intel64/libmkl_rt.so"
idx <- flexiblas_load_backend(mkl)
flexiblas_switch(idx)
sessionInfo()
print(flexiblas_current_backend())

rm(list=ls())
N = 100000
xvar <- runif(N, -10, 10) 
e <- rnorm(N, mean=0, sd=1)
yvar <- 1 + 2*xvar + e
plot(xvar,yvar)
lmMod <- lm(yvar~xvar)
print(summary(lmMod))

The correct coefficients are returned.

robertspierre
  • 3,218
  • 2
  • 31
  • 46