Is there a function in scipy for doing robust linear regression?
My current solution:
slope, intercept, r_value, p_value, std_err = stats.linregress(income, exp)
Is there a function in scipy for doing robust linear regression?
My current solution:
slope, intercept, r_value, p_value, std_err = stats.linregress(income, exp)
You can use ransac which stands for RANSAC (RANdom SAmple Consensus), that essentially tries to provide a robust estimate of the parameter. If you need p-values etc, maybe statsmodels is better. Below is an example data with some outliers:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
np.random.seed(555)
df = pd.DataFrame({"X":np.random.uniform(0,4,50)})
df['y'] = df["X"]*2 + np.random.normal(0,1,50)
df['y'][:5] = df["X"][:5]*4 + np.random.normal(3,1,5)
To compare we fit linear, ransac and statsmodels :
lr = linear_model.LinearRegression()
lr.fit(df[['X']], df['y'])
ransac = linear_model.RANSACRegressor()
ransac.fit(df[['X']], df['y'])
rlm = sm.RLM(df[['y']], sm.add_constant(df[['X']]), M=sm.robust.norms.HuberT())
rlm_results = rlm.fit()
Now plot the results and you can see linear regression is deviated due to the outliers and but not so much ransac:
line_X = np.arange(df.X.min(), df.X.max(),0.2).reshape(-1,1)
line_y = lr.predict(line_X)
line_y_ransac = ransac.predict(line_X)
line_y_rlm = rlm_results.predict(sm.add_constant(line_X))
df.plot.scatter(x='X',y='y')
plt.plot(line_X,line_y,c="g",label="linear_regression")
plt.plot(line_X,line_y_ransac,c="k",label="ransac_regression")
plt.plot(line_X,line_y_rlm,c="y",label="ransac_regression")
plt.legend(loc='lower right')
Also note that sometimes the parameters need to be tweaked, check out this discussion
Only statsmodels provide a pvalue, stderror etc:
rlm_results.summary()