Linear regression fails in Python with large values in dependent variables

Question

I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.

Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:

1) I created a simple DataFrame in Python and ran linear regression with it:

from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])

and the summary for the LR is:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   5

R-squared:         0.4627
Adj R-squared:    -1.6865

Rmse:             23.9493

F-stat (4, 1):     0.2153, p-value:     0.9026

Degrees of Freedom: model 4, resid 1

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.3212     1.1176       0.29     0.8218    -1.8693     2.5117
             C    -0.0488     0.1361      -0.36     0.7806    -0.3155     0.2178
             D     0.1512     0.4893       0.31     0.8092    -0.8077     1.1101
             E    -0.4508     0.8268      -0.55     0.6822    -2.0713     1.1697
     intercept    20.9222    23.6280       0.89     0.5386   -25.3887    67.2331
---------------------------------End of Summary---------------------------------

I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:

 use "/tmp/lr.dta", clear
 reg A B C D E

The result is the same:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =    0.22
       Model |  493.929019     4  123.482255           Prob > F      =  0.9026
    Residual |  573.570981     1  573.570981           R-squared     =  0.4627
-------------+------------------------------           Adj R-squared = -1.6865
       Total |      1067.5     5       213.5           Root MSE      =  23.949

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |   .3211939   1.117591     0.29   0.822    -13.87914    14.52153
           C |  -.0488429   .1360552    -0.36   0.781    -1.777589    1.679903
           D |   .1512067   .4892539     0.31   0.809    -6.065353    6.367766
           E |  -.4508122   .8267897    -0.55   0.682    -10.95617    10.05455
       _cons |    20.9222   23.62799     0.89   0.539    -279.2998    321.1442
------------------------------------------------------------------------------

I tried this in R, and got the same result.

2) However, if I increased the values of the dependent variables in Python:

df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)

I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object

The result I got is:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   2

R-squared:        -0.7223
Adj R-squared:    -1.1528

Rmse:             21.4390

F-stat (4, 4):    -1.6775, p-value:     1.0000

Degrees of Freedom: model 1, resid 4

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B    -0.0000     0.0000      -0.00     0.9973    -0.0000     0.0000
             C     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             D     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             E     0.0000     0.0000       0.00     0.9975    -0.0000     0.0000
     intercept     0.0000    21.7485       0.00     1.0000   -42.6271    42.6271
---------------------------------End of Summary---------------------------------

But in Stata, I got a very different result:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =  237.35
       Model |  1066.37679     4  266.594196           Prob > F      =  0.0486
    Residual |   1.1232144     1   1.1232144           R-squared     =  0.9989
-------------+------------------------------           Adj R-squared =  0.9947
       Total |      1067.5     5       213.5           Root MSE      =  1.0598

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |  -1.45e-45   2.32e-46    -6.24   0.101    -4.40e-45    1.50e-45
           C |   2.94e-06   3.67e-07     8.01   0.079    -1.72e-06    7.61e-06
           D |  -3.86e-21   6.11e-22    -6.31   0.100    -1.16e-20    3.90e-21
           E |   4.92e-37   7.88e-38     6.24   0.101    -5.09e-37    1.49e-36
       _cons |   9.881129    1.07512     9.19   0.069    -3.779564    23.54182
------------------------------------------------------------------------------

And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)

Residuals:
         1          2          3          4          5          6 
-1.757e-01  8.211e-01  1.287e-03 -1.269e-06  1.289e-09 -6.467e-01 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  9.881e+00  1.075e+00   9.191    0.069 .
B           -1.449e-45  2.322e-46  -6.238    0.101  
C            2.945e-06  3.674e-07   8.015    0.079 .
D           -3.855e-21  6.106e-22  -6.313    0.100  
E            4.919e-37  7.879e-38   6.243    0.101  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared:  0.9989,  Adjusted R-squared:  0.9947 
F-statistic: 237.3 on 4 and 1 DF,  p-value: 0.04864

Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?

Look closely at this: `df['B'] = pow(df['A'], 30)` it replaces `B` by `A**30`. There are similar lines for `C`, `D`, `E` replacing the original data for each variable with data from `A` raised to various powers. Is that a typo? Did you instead want new B = old B **30 instead? What did you use in stata and R? Does it match? — Paul, Jul 18 '15 at 08:04
@Paul No, it's not a typo. And as I've mentioned, I saved the data to .dta file in Python, and load them in R and Stata. So the data match. — Jackie Wong, Jul 18 '15 at 15:48
I used sklearn.linear_model.LinearRegression instead as it support normalizing, and the result now matches R and Stata. — Jackie Wong, Jul 21 '15 at 17:11

Jianxun Li · Accepted Answer · 2015-07-18T14:48:46.263

I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps gives 2.2204460492503131e-16, so everything less than eps*max_value_of_your_data will be treated essentially as 0 when you try any primitive operations like + - * /. For example, 1e117 + 1e100 == 1e117 returns True, because 1e100/1e117 = 1e-17 < eps. Now look at your data.

# your data
# =======================
print(df)

    A           B          C           D           E
0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25
1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32
2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36
3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40
4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42
5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33

When relative precision is taken into account,

# ===================================================
import numpy as np

np.finfo(float).eps # 2.2204460492503131e-16

df[df < df.max().max()*np.finfo(float).eps] = 0
df

   A           B  C  D           E
0  0  0.0000e+00  0  0  0.0000e+00
1  0  1.0737e+39  0  0  0.0000e+00
2  0  2.0589e+44  0  0  8.4729e+36
3  0  1.1529e+48  0  0  1.1259e+40
4  0  9.3132e+50  0  0  2.9802e+42
5  0  4.6407e+39  0  0  0.0000e+00

So there is no variation at all in y(A), and that's why statsmodels returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.

RE: "it's always a good practice to normalize your data first before running regression" http://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize — Brandon Bertelsen, Jul 18 '15 at 14:51
@jianxun-li: Thanks a lot! That's exactly what I suspect. Does R or Stata has special handling on float numbers, or they do normalizing implicitly before running LR? — Jackie Wong, Jul 18 '15 at 15:50
@brandon-bertelsen: Appreciate the link. I'll try to understand it. Sorry that I'm not a stats guy and has little understanding on this. — Jackie Wong, Jul 18 '15 at 15:53
@JackieWong Not sure how those domain-specific stats package handle this sort of relative precision issues. But as you mentioned, firstly normalizing RHS variables and then scaling back (with standard error) after regression estimation sounds like a valid approach. — Jianxun Li, Jul 18 '15 at 15:55

Linear regression fails in Python with large values in dependent variables

1 Answers1

Linked