Too accurate to be true! (linear regression)

Question

I have been trying to code a gradient descent algorithm from scratch for multi-featured linear regression but when I'm predicting using my own training dataset I'm getting too accurate results.

class gradientdescent:

  
     def fit(self,X,Y):
        lr=0.005    *learning rate*
        b=0
        M=np.array(1)
        M=np.arange(X.shape[1])
        n=np.size(X,0)
        M.fill(1)     #initial value for gradient
        for i in range(10000):
             sum=0
             sum1=0
             for j in range(n):
                 sum=sum+(np.dot(X[j],M)+b-Y[j])*X[j]
                 sum1=sum1+(np.dot(X[j],M)+b-Y[j])
             m_gradient=lr*sum/n
             b_gradient=lr*sum1/n
             M=M-m_gradient
             b=b-b_gradient
        self.b=b
        self.M=M
        self.n=n

This dataset I have taken below is too random, I had randomly entered values here in the X and Y array.

  X=np.array([[1,2,3,4,5],[2,1,4,3,5],[1,3,2,5,4],[3,0,1,2,4],[0,1,2,4,3]])
   Y=np.array([5,6,2,8,100])

my prediction function:

def predict(self,X):
           for i in range(self.n):
              print(np.dot(self.M,X[i])+self.b)

The predicted values:

5.000000000080892
5.999999999956618
1.9999999999655422
8.000000000004814
99.99999999998795

There is no way the plotted graph passes through the training dataset such closely as the data given was random so I had expected there to be little error. I even tried changing the data but still, it gives me these accurate results.

please tell me if there is any problem with my algorithm.

You mention "the plotted graph", but there is no graph shown. Nor the plotting code (which also may be the/a problem). Also, your `predict` function doesn't return anything: it just prints a value. — 9769953, Aug 27 '22 at 21:00
It seems that your model has 5 indeterminate coefficients, so that the system of 5 equations in 5 unknowns has an exact solution. The accuracy to 10 decimals only might be due to the value 100 which makes the problem somewhat unstable. — , Aug 27 '22 at 21:00
@YvesDaoust i tried changing the number of features, once made it to 4 features and I observed a little error, but when I increase the number of iterations from 100 to 1000 then to 10000 the error starts to decrease. my question is shouldn't my predictions stop changing after a certain number of iterations? — Srivaths Gondi, Aug 28 '22 at 11:38

score 1 · Accepted Answer · answered Aug 27 '22 at 22:13

Your X matrix is fully invertible thus the linear solution is just

w = X^-1 y

or in numpy

w = np.linalg.inv(X).dot(Y)  
# array([-13.98181818, -59.8       ,   2.74545455,  47.56363636,
         -11.98181818])

and then you get a perfect prediction

X.dot(w)
# array([  5.,   6.,   2.,   8., 100.])

This is happening because almost every "random" square matrix is invertible. The space of non-invertible matrices is actually pretty small, so you need to construct them somewhat carefully. Another situation where the error is to be expected is when you have less features than data points, and then linear regression will be much less likely to be perfect.

Too accurate to be true! (linear regression)

1 Answers1