I'm intrigued on why I'm unable to arrived at the same values the model is predicting.
Consider the below model. I'm trying to understand the relations between features insurance charges, age and if a client is or not a smoker.
Notice age variable has been pre-processed (mean centered).
import pandas as pd
import statsmodels.formula.api as smf
insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['I(age - np.mean(age))'], params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1
y_hat_smok = (b0 + b2) + (b1 + b3) * x1
Now when I generate new data and apply the predict method, I'll arrive at different values when trying to compute these manually. Take for example index 0 and index 2 ,I would expected the prediction values to be similar to the output below, but these are really far off.
Am I missing something regarding the feature transformation done when fitting the model?
new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43},
'smoker': {0: 'yes', 1: 'no', 2: 'no'}})
idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4
fit1.predict(new_data)
0 27581.276650
1 10168.273779
2 10702.771604