0
def fit_linear_regression(X, y):
    X = sm.add_constant(X)
    est = sm.OLS(y, X)
    est = est.fit()
    return est

print(X_train.shape) // outputs (604, 41)
print(X_test.shape) // outputs (95, 41)

model = fit_linear_regression(X_train.iloc[:, [0, 1, 2]], y_train)

model.predict(X_test.iloc[:, [0, 1, 2]])

When I run this script, I get the following error

ValueError: shapes (95,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

When I do not select any columns but just include the whole dataframes, it does the same with shapes(95, 41) and (42,) not aligned. What the hell is going on here?

X_train, y_train and y_test are panda dataframes.

RogerKint
  • 454
  • 5
  • 13
  • The problem is that you are not adding a constant to your `X_test` data before passing it to the `predict()` function. See [this](https://www.statsmodels.org/dev/examples/notebooks/generated/predict.html) example in the documentation showing how to do it properly. – AlexK Aug 02 '22 at 19:05

1 Answers1

0

As pointed out in the comments by @AlexK, you need to add the intercept (or constant) to your test data. In your function, you had this step:

X = sm.add_constant(X)

And this is used in fitting the model, so the model expects 4 columns instead of 3.

Using an example:

import pandas as pd
import numpy as np
import statsmodels.api as sm

X_train = pd.DataFrame(
    np.random.normal(0,1,(604,41)),
    columns = ["v" + str(i) for i in range(41)]
    )

X_test = pd.DataFrame(
    np.random.normal(0,1,(95,41)),
    columns = ["v" + str(i) for i in range(41)]
)

y_train = np.random.normal(0,1,(604,))
y_test = np.random.normal(0,1,(95,))

Fit and predict :

def fit_linear_regression(X, y):
    X = sm.add_constant(X)
    est = sm.OLS(y, X)
    est = est.fit()
    return est

model = fit_linear_regression(X_train.iloc[:, [0, 1, 2]], y_train)

model.predict(sm.add_constant(X_test.iloc[:, [0, 1, 2]]))

Since you are using a dataframe, I hope there are proper column names, so you can consider using the formula interface (see the help page), just adding a tweak to include all the columns in your input, see this post too :

import statsmodels.formula.api as smf

def formula_linear_regression(X, y):
    formula = "y ~ " + "+".join(X.columns)
    df = X.copy()
    df['y'] = y
    est = smf.ols(formula=formula, data=X)
    est = est.fit()
    return est

model2 = formula_linear_regression(X_train.iloc[:, [0, 1, 2]], y_train)

model.predict(X_test.iloc[:, [0, 1, 2]])
StupidWolf
  • 45,075
  • 17
  • 40
  • 72