2

Is there a function in scipy for doing robust linear regression?

My current solution:

slope, intercept, r_value, p_value, std_err = stats.linregress(income, exp)

walter
  • 51
  • 1
  • 3

1 Answers1

1

You can use ransac which stands for RANSAC (RANdom SAmple Consensus), that essentially tries to provide a robust estimate of the parameter. If you need p-values etc, maybe statsmodels is better. Below is an example data with some outliers:

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

np.random.seed(555)

df = pd.DataFrame({"X":np.random.uniform(0,4,50)})
df['y'] = df["X"]*2 + np.random.normal(0,1,50)
df['y'][:5] = df["X"][:5]*4 + np.random.normal(3,1,5)

To compare we fit linear, ransac and statsmodels :

lr = linear_model.LinearRegression()
lr.fit(df[['X']], df['y'])

ransac = linear_model.RANSACRegressor()
ransac.fit(df[['X']], df['y'])

rlm = sm.RLM(df[['y']], sm.add_constant(df[['X']]), M=sm.robust.norms.HuberT())
rlm_results = rlm.fit()

Now plot the results and you can see linear regression is deviated due to the outliers and but not so much ransac:

line_X = np.arange(df.X.min(), df.X.max(),0.2).reshape(-1,1)
line_y = lr.predict(line_X)
line_y_ransac = ransac.predict(line_X)
line_y_rlm = rlm_results.predict(sm.add_constant(line_X))

df.plot.scatter(x='X',y='y')
plt.plot(line_X,line_y,c="g",label="linear_regression")
plt.plot(line_X,line_y_ransac,c="k",label="ransac_regression")
plt.plot(line_X,line_y_rlm,c="y",label="ransac_regression")
plt.legend(loc='lower right')

enter image description here

Also note that sometimes the parameters need to be tweaked, check out this discussion

Only statsmodels provide a pvalue, stderror etc:

rlm_results.summary()
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thankyou for your answer! Is there a way to get slope, intercept, r_value, p_value, std_err etc easily from ransac object or do I need to write a function for that? – walter Nov 17 '20 at 23:27
  • you can do ```ransac.estimator_.coef_``` to get the coefficient, and ```ransac.estimator_.intercept_``` to get the intercept – StupidWolf Nov 17 '20 at 23:31
  • p value and standard error, ok I think you need to use statsmodels.. I can write an extension to the answer – StupidWolf Nov 17 '20 at 23:32