6

In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.

This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.

include_bias : boolean

If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

micharaze
  • 957
  • 8
  • 25

1 Answers1

19

Suppose you want to perform the following regression:

y ~ a + b x + c x^2

where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as

y ~ X B

The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.

The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.

If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.

In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.

The situation is different if you use statsmodels.OLS instead of LinearRegression

Andrea Araldo
  • 1,332
  • 14
  • 20
  • 2
    Thanks, the last part of your answer was exactly what I was looking for. – Anup Sebastian Feb 18 '20 at 01:07
  • Just like Anup, I liked how you included the last two paragraphs to clear up any confusion. – Apie Oct 28 '20 at 04:42
  • So here's a follow up. I am trying to fit a Lasso model. Now it internally has fit_intercept option as True, but PolynomialFeature also gives option to inclue_bias. Now using either of the ways to include bias/intercept term, the final outcome that I am getting is giving different values of intercept. Now if both do same thing, why is the intercept coming different? – Amit Amola Oct 10 '22 at 12:34