13

I'm trying to run a linear regression in Python that I have already done in R in order to find variables with 0 coefficients. The issue I'm running into is that the linear regression in R returns NAs for columns with low variance while the scikit learn regression returns the coefficients. In the R code, I find and save these variables by saving the variables with NAs as output from the linear regression, but I can't seem to figure out a way to mimic this behavior in python. The code I'm using can be found below.

R Code:

a <- c(23, 45, 546, 42, 68, 15, 47)
b <- c(1, 2, 4, 6, 34, 2, 8)
c <- c(22, 33, 44, 55, 66, 77, 88)
d <- c(1, 1, 1, 1, 1, 1, 1)
e <- c(1, 1, 1, 1, 1, 1, 1.1)
f <- c(1, 1, 1, 1, 1, 1, 1.01)
g <- c(1, 1, 1, 1, 1, 1, 1.001)

df <- data.frame(a, b, c, d, e, f, g)
var_list = c('b', 'c', 'd', 'e', 'f', 'g')

target <- temp_dsin.df$a
reg_data <- cbind(target, df[, var_list])


if (nrow(reg_data) < length(var_list)){
  message(paste0('    WARNING: Data set is rank deficient. Result may be doubtful'))
}
reg_model <- lm(target ~ ., data = reg_data)

print(reg_model$coefficients)

#store the independent variables with 0 coefficients
zero_coef_IndepVars.v <- names(which(is.na(reg_model$coefficients)))

print(zero_coef_IndepVars.v)

Python Code:

import pandas as pd
from sklearn import linear_model

a = [23, 45, 546, 42, 68, 15, 47]
b = [1, 2, 4, 6, 34, 2, 8]
c = [22, 33, 44, 55, 66, 77, 88]
d = [1, 1, 1, 1, 1, 1, 1]
e = [1, 1, 1, 1, 1, 1, 1.1]
q = [1, 1, 1, 1, 1, 1, 1.01]
f = [1, 1, 1, 1, 1, 1, 1.001]


df = pd.DataFrame({'a': a,
                             'b': b,
                             'c': c,
                             'd': d,
                             'e': e,
                             'f': q,
                             'g': f})


var_list = ['b', 'c', 'd', 'e', 'f', 'g']

# build linear regression model and test for linear combination
target = df['a']
reg_data = pd.DataFrame()
reg_data['a'] = target
train_cols = df.loc[:,df.columns.str.lower().isin(var_list)]


if reg_data.shape[0] < len(var_list):
    print('    WARNING: Data set is rank deficient. Result may be doubtful')

# Create linear regression object
reg_model = linear_model.LinearRegression()

# Train the model using the training sets
reg_model.fit(train_cols , reg_data['a'])

print(reg_model.coef_)

Output from R:

(Intercept)           b           c           d           e           f           g 
 537.555988   -0.669253   -1.054719          NA -356.715149          NA          NA 

> print(zero_coef_IndepVars.v)
[1] "d" "f" "g"

Output from Python:

           b             c   d               e              f            g
[-0.66925301   -1.05471932   0.   -353.1483504   -35.31483504   -3.5314835]

As you can see, the values for columns 'b', 'c', and 'e' are close, but very different for 'd', 'f', and 'g'. For this example regression, I would want to return ['d', 'f', 'g'] as their outputs are NA from R. The issue is that the sklearn linear regression returns 0 for col 'd', while it returns -35.31 for col 'f' and -3.531 for col 'g'.

Does anyone know how R decides on whether to return NA or a value/how to implement this behavior into the Python version? Knowing where the differences stem from will likely help me implement the R behavior in python. I need the results of the python script to match the R outputs exactly.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Nizag
  • 909
  • 1
  • 9
  • 15
  • 1
    Just noting they are `NA`, not `NaN`. – Axeman Apr 20 '17 at 16:19
  • @SinanÜnür so do you believe there is a co-linearity check in the R linear regression? I figured it was something like that, which is why I chose my data the way that I did, but I need to replicate this behavior in python. – Nizag Apr 20 '17 at 16:22

2 Answers2

18

It's a difference in implementation. lm in R uses underlying C code that is based on a QR decomposition. The model matrix is decomposed into an orthogonal matrix Q and a triangular matrix R. This causes what others called "a check on collinearity". R doesn't check that, the nature of the QR decomposition assures that the least collinear variables are getting "priority" in the fitting algorithm.

More info on QR decomposition in the context of linear regression: https://www.stat.wisc.edu/~larget/math496/qr.html

The code from sklearn is basically a wrapper around numpy.linalg.lstsq, which minimizes the Euclidean quadratic norm. If your model is Y = AX, it minimizes ||Y - AX||^2. This is a different (and computationally less stable) algorithm, and it doesn't have the nice side effect of the QR decomposition.

Personal note: if you want to have robust fitting of models in a proven and tested computational framework and insist on using Python, look for linear regression implementations that are based on QR or SVD. The packages scikit-learn or statsmodels (still in beta as per 22 april 2017) should get you there.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • Thanks guys, this is super helpful! I'll look through and see if there is anything existing in python that uses the QR decomposition implementation, or stick with R if that doesn't work out. – Nizag Apr 20 '17 at 17:01
  • Personal note: If you want some variable dropped randomly from your regression, then use R. If you want a SVD/pinv regularized solution, then use Python scikit-learn or statsmodels. If you want neither, then clean your data and choose your variables yourself. – Josef Apr 21 '17 at 21:09
  • @user333700 There is nothing random about a QR decomposition, as shown by the question. But thank you anyway for a pointer to an SVD based method. And for the record: half of the research at our department is done in Python. Python is a great language. But for a simple linear regression, we have now mentioned actually 5 packages already (numpy, scipy, sklearn, scikit-learn and statsmodels). I'll stick with R in my class statistical computing: less explaining to do, and the standard tool is a stable one. Each to his own I guess. – Joris Meys Apr 22 '17 at 17:33
  • see also answers in http://stackoverflow.com/questions/40935624/differences-in-linear-regression-in-r-and-python/40937228 for more details. (It is very unlikely that statsmodels will ever get variable selection by pivoting QR, even though it could now be implemented with scipy.linalg.) – Josef Apr 22 '17 at 19:18
  • I played a bit with the example: If we change `d` by adding 1e-6, i.e. `d <- c(1, 1, 1, 1, 1, 1, 1 + 1e-6)`, then `d` is kept and `e` is dropped. If I add 1e-7 instead of 1e-6, then `d` is still dropped as in the original case. This looks pretty "random" (fragile) to me for repeated use cases with slightly different datasets. – Josef Apr 22 '17 at 19:25
  • I forgot to add in last comment: version R x64 3.2.0 on Windows – Josef Apr 22 '17 at 19:38
  • @user333700 If you calculate the QR matrices, you see why this happens. That might look random, it isn't. And for the record: linear regression is NOT a suitable technique for this data, regardless of the underlying computing algorithm. I understand you want to defend Python. I'm not going to claim R code is easier to read than Python code. Because it isn't. I'm also not going to claim Python is better/ more stable for statistics than R. Because it isn't. That doesn't mean it's useless. I take numpy and scipy over Java, Ruby, or Excel if you want ever day. – Joris Meys Apr 23 '17 at 13:14
  • I'm not defending Python per se. I wanted to point out that there are intentional design decisions where packages like statsmodels, scikit-learn or other ones like Stata differ from R. (This is not necessarily related to how old and stable a package is. Nevertheless, statsmodels does have some models that do not yet work very well in difficult or edge cases and that are not yet stable in that sense.) – Josef Apr 23 '17 at 13:41
  • @user333700 and I thank you for that. You might not have noticed, but I added that information to my answer. Here's to science ;-) – Joris Meys Apr 23 '17 at 14:27
1

I guess there's is not data enough. This is the result of statsmodel:

import statsmodels.formula.api as smf
lm = smf.ols(formula='a ~ b + c + d + e + f + g', data=df).fit()
lm.summary()

gives :

OLS Regression Results
Dep. Variable: a R-squared: 0.038
Model: OLS Adj. R-squared: -0.923
Method: Least Squares F-statistic: 0.03993
Date: Fri, 21 Apr 2017 Prob (F-statistic): 0.987
Time: 22:29:16 Log-Likelihood: -46.059
No. Observations: 7 AIC: 100.1
Df Residuals: 3 BIC: 99.90
Df Model: 3  
Covariance Type: nonrobust  
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 151.5350 1065.536 0.142 0.896 -3239.476 3542.545
b -0.6693 10.324 -0.065 0.952 -33.526 32.188
c -1.0547 6.412 -0.164 0.880 -21.462 19.352
d 151.5350 1065.536 0.142 0.896 -3239.476 3542.545
e -368.1353 3862.592 -0.095 0.930 -1.27e+04 1.19e+04
f 99.5679 574.110 0.173 0.873 -1727.506 1926.642
g 146.3383 1016.341 0.144 0.895 -3088.111 3380.788
Omnibus: nan Durbin-Watson: 2.447
Prob(Omnibus): nan Jarque-Bera (JB): 4.545
Skew: 1.797 Prob(JB): 0.103
Kurtosis: 4.632 Cond. No. 1.34e+18

OLS gives several clues this lineair problem is ill conditioned.

cast42
  • 1,999
  • 1
  • 16
  • 10