I wanted to implement a multiple regression model and wrote the following code:
import numpy as np
from sklearn.preprocessing import StandardScaler
class MatrixLinearRegression:
def __init__(self):
pass
def fit(self, X, Y):
X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
tmp1 = np.linalg.inv(np.dot(X_.T, X_))
tmp2 = np.dot(X_.T,Y)
self.betas = np.dot(tmp1, tmp2)
#print(np.dot(tmp1, tmp2))
def score(self, X, Y):
X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
prediction = np.dot(X_, self.betas)
Y_mean = np.mean(Y)
ssr = np.sum((prediction - Y_mean)**2)
ssto = np.sum((Y - Y_mean)**2)
return ssr/ssto
X = np.array(np.mat('70 1;69 1;60 1;69 1;70 1;69 1;70 1;83 0;70 0;75 0;74 0;90 0;87 0;86 0;85 0'))
Y = np.array(np.mat('495;420;330;420;495;420;495;580;390;535;420;500;620;580;600'))
model = MatrixLinearRegression()
model.fit(X, Y)
print('Working example score: {:.2f}'.format(model.score(X, Y)))
np.random.seed(30)
X = np.random.randint(500, size=(10, 14))
Y = np.random.randint(500, size=(10,1))
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X) # Does neither work with scaling or without
model.fit(X, Y)
print('Not working example score: {:.2f}'.format(model.score(X, Y)))
I implemented two examples (see code above). The first regression seems to be legit (I can reproduce it with R and with the solutions of my teacher), however, for the second example the value is > 152, which seems unrealistic, since it should be between 0 and 1.
Right now, I am unable to find the error.
Can anyone hint me at the right direction?
P.S. This is somewhat a cross-post. However, on another platform I was unable to get an answer, so I try again here. I hope this is ok. Otherwise feel free, to delete this question.
Update I try to add more context. My ulimate goal was to reproduce the LinearRegression class from the scikit-learn package. As a test dataset I used the extended boston dataset from the Introduction to Machine Learning with Python book.
So the (not minimal) code is:
import mglearn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Linear Regression w/ sklearn
# Load dataset
X, Y = mglearn.datasets.load_extended_boston()
# Make train/test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
# Create model
model = LinearRegression()
model.fit(X_train, Y_train)
# Check accuracy
train_acc = model.score(X_train, Y_train)
test_acc = model.score(X_test, Y_test)
print('Train accuracy of Scikit Learn: {:.2f}\r\nTest accuracy of Scikit Learn: {:.2f}'.format(train_acc, test_acc))
# Linear Regression w/ Matrix fun
import numpy as np
from sklearn.preprocessing import StandardScaler
class MatrixLinearRegression:
def __init__(self):
pass
def fit(self, X, Y):
X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
tmp1 = np.linalg.inv(np.dot(X_.T, X_))
tmp2 = np.dot(X_.T,Y)
self.betas = np.dot(tmp1, tmp2)
def score(self, X, Y): #(by ely from stackoverflow)
X_ = np.append(np.ones((X.shape[0],1)), X, axis = 1)
prediction = np.dot(X_, self.betas)
Y_mean = np.mean(Y)
ssr = np.sum((prediction - Y)**2)
ssto = np.sum((Y - Y_mean)**2)
return 1 - ssr / ssto
model = MatrixLinearRegression()
model.fit(X_train, Y_train)
print('Train Accuracy of Own Implementation: {:.2f}'.format(model.score(X_train, Y_train)))
I noticed that the accuracy of my implementation is different from the sklearn one's. So I tried out different examples, some of them working, some were not.