38

In the sklearn.linear_model.LinearRegression method, there is a parameter that is fit_intercept = TRUE or fit_intercept = FALSE. I am wondering if we set it to TRUE, does it add an additional intercept column of all 1's to your dataset? If I already have a dataset with a column of 1's, does fit_intercept = FALSE account for that or does it force it to fit a zero intercept model?

Update: It seems people do not get my question. The question is, what IF I had already a column of 1's in my dataset of predictors (the 1's are for the intercept). THEN,

  1. if I use fit_intercept = FALSE, will it remove the column of 1's?

  2. if I use fit_intercept = TRUE, will it add an EXTRA column of 1's?

user321627
  • 2,350
  • 4
  • 20
  • 43
  • Please have a look at [this question](https://stats.stackexchange.com/questions/102709/when-forcing-intercept-of-0-in-linear-regression-is-acceptable-advisable), [this](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model/32518#32518) and also [this](https://stackoverflow.com/questions/24393518/python-sklearn-linear-model-linearregression-working-weird). – Vivek Kumar Oct 17 '17 at 01:38
  • My question is unrelated to all 3, I've updated it accordingly for more clarification. – user321627 Oct 17 '17 at 03:09

1 Answers1

53

fit_intercept=False sets the y-intercept to 0. If fit_intercept=True, the y-intercept will be determined by the line of best fit.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

bias = 100

X = np.arange(1000).reshape(-1,1)
y_true = np.ravel(X.dot(0.3) + bias)
noise = np.random.normal(0, 60, 1000)
y = y_true + noise

lr_fi_true = LinearRegression(fit_intercept=True)
lr_fi_false = LinearRegression(fit_intercept=False)

lr_fi_true.fit(X, y)
lr_fi_false.fit(X, y)

print('Intercept when fit_intercept=True : {:.5f}'.format(lr_fi_true.intercept_))
print('Intercept when fit_intercept=False : {:.5f}'.format(lr_fi_false.intercept_))

lr_fi_true_yhat = np.dot(X, lr_fi_true.coef_) + lr_fi_true.intercept_
lr_fi_false_yhat = np.dot(X, lr_fi_false.coef_) + lr_fi_false.intercept_

plt.scatter(X, y, label='Actual points')
plt.plot(X, lr_fi_true_yhat, 'r--', label='fit_intercept=True')
plt.plot(X, lr_fi_false_yhat, 'r-', label='fit_intercept=False')
plt.legend()

plt.vlines(0, 0, y.max())
plt.hlines(bias, X.min(), X.max())
plt.hlines(0, X.min(), X.max())

plt.show()

This example prints:

Intercept when fit_intercept=True : 100.32210
Intercept when fit_intercept=False : 0.00000

Visually it becomes clear what fit_intercept does. When fit_intercept=True, the line of best fit is allowed to "fit" the y-axis (close to 100 in this example). When fit_intercept=False, the intercept is forced to the origin (0, 0).

fit_intercept in sklearn


What happens if I include a column of ones or zeros and set fit_intercept to True or False?

Below shows an example of how to inspect this.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
bias = 100

X = np.arange(1000).reshape(-1,1)
y_true = np.ravel(X.dot(0.3) + bias)
noise = np.random.normal(0, 60, 1000)
y = y_true + noise

# with column of ones
X_with_ones = np.hstack((np.ones((X.shape[0], 1)), X))

for b,data in ((True, X), (False, X), (True, X_with_ones), (False, X_with_ones)):
  lr = LinearRegression(fit_intercept=b)
  lr.fit(data, y)

  print(lr.intercept_, lr.coef_)

Take-away:

# fit_intercept=True, no column of zeros or ones
104.156765787 [ 0.29634031]
# fit_intercept=False, no column of zeros or ones
0.0 [ 0.45265361]
# fit_intercept=True, column of zeros or ones
104.156765787 [ 0.          0.29634031]
# fit_intercept=False, column of zeros or ones
0.0 [ 104.15676579    0.29634031]
Jarad
  • 17,409
  • 19
  • 95
  • 154
  • 2
    If I have already included a column of one's in my set of predictor columns, what happens if I fit it using TRUE and then FALSE? – user321627 Oct 17 '17 at 03:06
  • Is there a mistake in the picture? The dotted line should be for fit_intercept = False and the solid line should be for fit_intercept = True, right? – Huy Truong May 27 '21 at 02:22
  • @HuyTruong what makes you think that? – Jarad May 27 '21 at 19:37
  • Oh man, I'm so sorry. My bad. I glanced at the plot quickly and thought y = 100 was y = 0. (So the dotted line went through y = 0 and that's why I had claimed there was a mistake in the picture). – Huy Truong May 27 '21 at 23:34
  • Thanks for posting this answer. Actually, what I don't understand is why should we have the option of fit_intercept = False? Isn't it always better to fit the intercept? – Xin Niu Apr 21 '22 at 18:09