Problems with Matrix Multiple Regression

Question

I wanted to implement a multiple regression model and wrote the following code:

import numpy as np
from sklearn.preprocessing import StandardScaler

class MatrixLinearRegression:
    def __init__(self):
        pass

    def fit(self, X, Y):
        X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
        tmp1 = np.linalg.inv(np.dot(X_.T, X_))
        tmp2 = np.dot(X_.T,Y)
        self.betas = np.dot(tmp1, tmp2)
        #print(np.dot(tmp1, tmp2))

    def score(self, X, Y):
        X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
        prediction = np.dot(X_, self.betas)
        Y_mean = np.mean(Y)
        ssr = np.sum((prediction - Y_mean)**2)
        ssto = np.sum((Y - Y_mean)**2)
        return ssr/ssto

X = np.array(np.mat('70 1;69 1;60 1;69 1;70 1;69 1;70 1;83 0;70 0;75 0;74 0;90 0;87 0;86 0;85 0'))
Y = np.array(np.mat('495;420;330;420;495;420;495;580;390;535;420;500;620;580;600'))

model = MatrixLinearRegression()
model.fit(X, Y)
print('Working example score: {:.2f}'.format(model.score(X, Y)))

np.random.seed(30)
X = np.random.randint(500, size=(10, 14))
Y = np.random.randint(500, size=(10,1))

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X) # Does neither work with scaling or without
model.fit(X, Y)
print('Not working example score: {:.2f}'.format(model.score(X, Y)))

I implemented two examples (see code above). The first regression seems to be legit (I can reproduce it with R and with the solutions of my teacher), however, for the second example the value is > 152, which seems unrealistic, since it should be between 0 and 1.

Right now, I am unable to find the error.

Can anyone hint me at the right direction?

P.S. This is somewhat a cross-post. However, on another platform I was unable to get an answer, so I try again here. I hope this is ok. Otherwise feel free, to delete this question.

Update I try to add more context. My ulimate goal was to reproduce the LinearRegression class from the scikit-learn package. As a test dataset I used the extended boston dataset from the Introduction to Machine Learning with Python book.

So the (not minimal) code is:

import mglearn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Linear Regression w/ sklearn

# Load dataset
X, Y = mglearn.datasets.load_extended_boston()

# Make train/test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)

# Create model
model = LinearRegression()
model.fit(X_train, Y_train)

# Check accuracy
train_acc = model.score(X_train, Y_train)
test_acc = model.score(X_test, Y_test)
print('Train accuracy of Scikit Learn: {:.2f}\r\nTest accuracy of Scikit Learn: {:.2f}'.format(train_acc, test_acc))

# Linear Regression w/ Matrix fun
import numpy as np
from sklearn.preprocessing import StandardScaler

class MatrixLinearRegression:
    def __init__(self):
        pass

    def fit(self, X, Y):
        X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
        tmp1 = np.linalg.inv(np.dot(X_.T, X_))
        tmp2 = np.dot(X_.T,Y)
        self.betas = np.dot(tmp1, tmp2)

    def score(self, X, Y): #(by ely from stackoverflow)
        X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
        prediction = np.dot(X_, self.betas)
        Y_mean = np.mean(Y)
        ssr = np.sum((prediction - Y)**2)
        ssto = np.sum((Y - Y_mean)**2)
        return 1 - ssr / ssto

model = MatrixLinearRegression()
model.fit(X_train, Y_train)
print('Train Accuracy of Own Implementation: {:.2f}'.format(model.score(X_train, Y_train)))

I noticed that the accuracy of my implementation is different from the sklearn one's. So I tried out different examples, some of them working, some were not.

ely · Accepted Answer · 2018-03-11T18:46:06.140

1

Your custom formula for the coefficient of determination (output of score) is off.

From the notation of the Wikipedia article on coefficient of determination, you are choosing a special formula, ss_reg / ss_tot. But this only holds under certain conditions, as explained in the Wiki link. Because your second example (the randomly generated data) is an underdetermined system (fewer observations than covariates), these conditions are inapplicable, and you will have fewer degrees of freedom in the residual-based variance calculations (implicitly dividing by n - 1 where n is the number of observations) than you have covariates.

In this case, you need to use the explicit definition of r-squared instead of the special decomposition formula, so you need 1 - x where x is your existing score and your term ssr should also be defined as np.sum((prediction - Y) ** 2), and not using Y_mean for that term.

Note that if you just increase the number of observations for the randomly generated data, the problem will go away, but if you want a definition of r-squared for the underdetermined data, you need to use the explicit definition of r-squared and not the decomposition formula.

See the documentation for the same implementation for the standard LinearRegression.

def score(self, X, Y):
    X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
    prediction = np.dot(X_, self.betas)
    Y_mean = np.mean(Y)
    ssr = np.sum((prediction - Y)**2)
    ssto = np.sum((Y - Y_mean)**2)
    return 1 - ssr / ssto

There is also a warning:

/home/ely/anaconda/envs/py36-keras/lib/python3.6/site-packages/sklearn/utils/validation.py:444: 
DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

If you are happy to treat the first covariate column as a continuous random variable, this is not a problem. However, the second covariate column is clearly a binary categorical variable. In that case, it doesn't make sense to scale and re-center the variable, and doing so can actually hurt the regression output. Better would be to only rescale the first column, or use a different regression algorithm, like a regression tree, that doesn't need the inputs to be rescaled.

edited Mar 11 '18 at 18:46

answered Mar 11 '18 at 16:56

ely

74,674
34
147
228

Okay, although your code works, I am a bit unsure why. I implemented the formula from [here](https://onlinecourses.science.psu.edu/stat501/node/255). Your code now uses SSE instead of SSR, I guess. – Franz Mar 11 '18 at 18:02
@Franz I also just realized that your second, random example suffers from under-determination (14+1 coefficients, but only 10 observations). I added a bit. – ely Mar 11 '18 at 18:02
Thanks for your effort, ely! :) I just noticed, I cannt reproduce your not working example score. It is in my case negative ad not 0.39... – Franz Mar 11 '18 at 18:07
@Franz in that case, it sounds like you may have some additional errors in the code that is not shown here, since the numbers I have come directly from the code you have put in the question. If you can update the question, I will continue later on to help resolve it. But please take not of the extra issue I mentioned regarding the under-determined system. – ely Mar 11 '18 at 18:09
Okay, this is weird. I created a new document with the code from I posted here on stackoverflow and changed your score method and still my score is negative (-5.45)... I will try to investigate further, however this still feels weird... – Franz Mar 11 '18 at 18:14
@Franz r-squared values are allowed to be negative, that is expected. The max value is 1, but they can be negative because models can be arbitrarily misspecified. The weird part is that you saw a value larger than 1, but this also makes sense **because your random example is underdetermined** meaning you have more free variables than you have degrees of freedom, so you can "explain too much variance" so to speak. – ely Mar 11 '18 at 18:20
I updated my initial post with my initial ideas and another example. Maybe this can help to tackle the problem. In the updated case you can see, that my model's accuracy is a lot different than the one from sklearn. – Franz Mar 11 '18 at 18:25
Some [further details](https://stackoverflow.com/a/23726705/567620) about why your computed solution may differ for underdetermined systems. – ely Mar 11 '18 at 18:40
Great, that's totally it. I did not know, that the extended boston example is underdetermined as well. Thank you very much, ely. :) – Franz Mar 11 '18 at 19:40

Problems with Matrix Multiple Regression

1 Answers1