I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.
Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:
1) I created a simple DataFrame in Python and ran linear regression with it:
from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])
and the summary for the LR is:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 5
R-squared: 0.4627
Adj R-squared: -1.6865
Rmse: 23.9493
F-stat (4, 1): 0.2153, p-value: 0.9026
Degrees of Freedom: model 4, resid 1
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.3212 1.1176 0.29 0.8218 -1.8693 2.5117
C -0.0488 0.1361 -0.36 0.7806 -0.3155 0.2178
D 0.1512 0.4893 0.31 0.8092 -0.8077 1.1101
E -0.4508 0.8268 -0.55 0.6822 -2.0713 1.1697
intercept 20.9222 23.6280 0.89 0.5386 -25.3887 67.2331
---------------------------------End of Summary---------------------------------
I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:
use "/tmp/lr.dta", clear
reg A B C D E
The result is the same:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 0.22
Model | 493.929019 4 123.482255 Prob > F = 0.9026
Residual | 573.570981 1 573.570981 R-squared = 0.4627
-------------+------------------------------ Adj R-squared = -1.6865
Total | 1067.5 5 213.5 Root MSE = 23.949
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | .3211939 1.117591 0.29 0.822 -13.87914 14.52153
C | -.0488429 .1360552 -0.36 0.781 -1.777589 1.679903
D | .1512067 .4892539 0.31 0.809 -6.065353 6.367766
E | -.4508122 .8267897 -0.55 0.682 -10.95617 10.05455
_cons | 20.9222 23.62799 0.89 0.539 -279.2998 321.1442
------------------------------------------------------------------------------
I tried this in R, and got the same result.
2) However, if I increased the values of the dependent variables in Python:
df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)
I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object
The result I got is:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 2
R-squared: -0.7223
Adj R-squared: -1.1528
Rmse: 21.4390
F-stat (4, 4): -1.6775, p-value: 1.0000
Degrees of Freedom: model 1, resid 4
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B -0.0000 0.0000 -0.00 0.9973 -0.0000 0.0000
C 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
D 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
E 0.0000 0.0000 0.00 0.9975 -0.0000 0.0000
intercept 0.0000 21.7485 0.00 1.0000 -42.6271 42.6271
---------------------------------End of Summary---------------------------------
But in Stata, I got a very different result:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 237.35
Model | 1066.37679 4 266.594196 Prob > F = 0.0486
Residual | 1.1232144 1 1.1232144 R-squared = 0.9989
-------------+------------------------------ Adj R-squared = 0.9947
Total | 1067.5 5 213.5 Root MSE = 1.0598
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | -1.45e-45 2.32e-46 -6.24 0.101 -4.40e-45 1.50e-45
C | 2.94e-06 3.67e-07 8.01 0.079 -1.72e-06 7.61e-06
D | -3.86e-21 6.11e-22 -6.31 0.100 -1.16e-20 3.90e-21
E | 4.92e-37 7.88e-38 6.24 0.101 -5.09e-37 1.49e-36
_cons | 9.881129 1.07512 9.19 0.069 -3.779564 23.54182
------------------------------------------------------------------------------
And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)
Residuals:
1 2 3 4 5 6
-1.757e-01 8.211e-01 1.287e-03 -1.269e-06 1.289e-09 -6.467e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.881e+00 1.075e+00 9.191 0.069 .
B -1.449e-45 2.322e-46 -6.238 0.101
C 2.945e-06 3.674e-07 8.015 0.079 .
D -3.855e-21 6.106e-22 -6.313 0.100
E 4.919e-37 7.879e-38 6.243 0.101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared: 0.9989, Adjusted R-squared: 0.9947
F-statistic: 237.3 on 4 and 1 DF, p-value: 0.04864
Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?