X has 1 features, but LinearRegression is expecting 10 features as input

Question

I've seen similar questions asked here but they all seem to be caused by a different problem. I've tried reshaping and making sure it's a 2d array but i keep getting this error. Here is my code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from io import StringIO
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
d = pd.read_csv("http://www.stat.wisc.edu/~jgillett/451/data/mtcars.csv")
X = d[['cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
y=d[['mpg']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test).reshape(-1,1)
y_test=y_test.reshape(-1,1)
model.score(pred,y_test)

I'd appreciate any help!

score 1 · Answer 1 · answered Apr 01 '23 at 02:13

The arguments for model.score() should be X_test and y_test, not pred and y_test.

From the docs:

Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Test samples. For some estimators this may be a precomputed
            kernel matrix or a list of generic objects instead with shape
            ``(n_samples, n_samples_fitted)``, where ``n_samples_fitted``
            is the number of samples used in the fitting for the estimator.
        y : array-like of shape (n_samples,) or (n_samples, n_outputs)
            True values for `X`.

score 0 · Answer 2 · answered Apr 01 '23 at 02:06

this should work

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from io import StringIO
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
d = pd.read_csv("http://www.stat.wisc.edu/~jgillett/451/data/mtcars.csv")
X = d[['cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
y=d[['mpg']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test).reshape(-1,1)
y_test=y_test.reshape(-1,1)
model.score(X_test,y_test)

It would be more helpful if you can explain what goes wrong, and what does your code fix — Minh-Long Luu, Apr 01 '23 at 02:56

score 0 · Answer 3 · answered Apr 01 '23 at 06:23

If you want to evaluate the quality of your predictions y_test vs pred, you have to use some regression metrics like:

R2 score:

from sklearn.metrics import r2_score
print(r2_score(y_test, pred))
# Out: 0.0058351676336382274

Mean Absolute Error:

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test, pred))
# Out: 4.268248988378566

Mean Squared Error:

from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, pred))
# Out: 30.227426389844176

Mean absolute percentage error:

from sklearn.metrics import mean_absolute_percentage_error
print(mean_absolute_percentage_error(y_test, pred))
# Out: 0.23155559024744402

The last metric is interesting since you can understand it immediately because it's a value between 0 and 1. Here, you have a mean absolute error of +/- 23% between real values and predicted values.

There is more metrics to evaluate the quality of your predictions. When you use model.score the score is the coefficient of determination R2, which is the same of r2_score metric:

>>> model.score(X_test, y_test)
0.0058351676336382274

>>> r2_score(y_test, pred)
0.0058351676336382274

X has 1 features, but LinearRegression is expecting 10 features as input

3 Answers3