As pointed out in the comments by @AlexK, you need to add the intercept (or constant) to your test data. In your function, you had this step:
X = sm.add_constant(X)
And this is used in fitting the model, so the model expects 4 columns instead of 3.
Using an example:
import pandas as pd
import numpy as np
import statsmodels.api as sm
X_train = pd.DataFrame(
np.random.normal(0,1,(604,41)),
columns = ["v" + str(i) for i in range(41)]
)
X_test = pd.DataFrame(
np.random.normal(0,1,(95,41)),
columns = ["v" + str(i) for i in range(41)]
)
y_train = np.random.normal(0,1,(604,))
y_test = np.random.normal(0,1,(95,))
Fit and predict :
def fit_linear_regression(X, y):
X = sm.add_constant(X)
est = sm.OLS(y, X)
est = est.fit()
return est
model = fit_linear_regression(X_train.iloc[:, [0, 1, 2]], y_train)
model.predict(sm.add_constant(X_test.iloc[:, [0, 1, 2]]))
Since you are using a dataframe, I hope there are proper column names, so you can consider using the formula interface (see the help page), just adding a tweak to include all the columns in your input, see this post too :
import statsmodels.formula.api as smf
def formula_linear_regression(X, y):
formula = "y ~ " + "+".join(X.columns)
df = X.copy()
df['y'] = y
est = smf.ols(formula=formula, data=X)
est = est.fit()
return est
model2 = formula_linear_regression(X_train.iloc[:, [0, 1, 2]], y_train)
model.predict(X_test.iloc[:, [0, 1, 2]])