2

In a book I have found the following code which fits a LinearRegression to quadratic data:

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

enter image description here

But how could that be? I know from the documentation that PolynomialFeatures(degree=2, include_bias=False) is creating an array which looks like:

[[X[0],X[0]**2]
[X[1],X[1]**2]
.....
[X[n],X[n]**2]]

BUT: How is the LinearRegression able to fit this data? Means WHAT is the LinearRegression doing and what is the concept behind this.

I am grateful for any explanations!

2Obe
  • 3,570
  • 6
  • 30
  • 54

1 Answers1

3

PolynomialFeatures with degree two will create an array that looks like:

   [[1, X[0], X[0]**2]
    [1, X[1], X[1]**2]
    .....
    [1, X[n] ,X[n]**2]]

Let's call the matrix above X. Then the LinearRegression is looking for 3 numbers a,b,c so that the vector

X* [[a],[b],[c]] - Y

has the smallest possible mean squared error (which is just the mean of the sum of the squares in the vector above).

Note that the product X* [[a],[b],[c]] is just a product of the matrix X with the column vector [a,b,c].T . The result is a vector of the same dimension as Y.

Regarding the questions in your comment:

  1. This function is linear in the new set of features: x, x**2. Just think about x**2 as an additional feature in your model.

  2. For the particular array mentioned in your question, the LinearRegression method is looking for numbers a,b,c that minimize the sum

    (a*1+bX[0]+cX[0]**2-Y[0])**2+(a*1+bX[1]+cX[1]**2-Y[1])**2+..+(a*1+bX[n]+cX[n]**2-Y[n])**2

So it will find a set of such numbers a,b,c. Hence the suggested function y=a+b*x+c*x**2 is not based only on the first row. Instead, it is based on all the rows, because the parameters a,b,c that are chosen are those that minimize the sum above, and this sum involves elements from all the rows.

  1. Once you created the vector x**2, the linear regression just regards it as an additional feature. You can give it a new name v=x**2. Then the linear regression is of the form y=a+b*x+c*v, which means, it is linear in x and v. The algorithm does not care how you created v. It just treats v as an additional feature.
Miriam Farber
  • 18,986
  • 14
  • 61
  • 76
  • 1
    Ok thanks. Now lets say, the LinearRegression function has found the optimal parameters with a=1, b=2 and c=3, than the function for the first row becomes: y= 3x**2+2x+1. And now?? 1. What is the LinearRegression doing because this function is not linear..... 2. Further, if the LinearRegression is doing this for each row in the array, is it right that in a n*m array, n linear regressions are computed? And 3. I still dont't get how a linear regression can get a curved shape??? – 2Obe Jul 13 '17 at 22:27
  • 1
    Additional feature means an additional axis right? Thus the LinearRegression curve in a two dimensional coordinate system could look like a curve but actually it is still a straight line but in a higher dimensional space? – 2Obe Jul 13 '17 at 22:50
  • 1
    @2Obe yes exactly. – Miriam Farber Jul 13 '17 at 22:55