2

I'm intrigued on why I'm unable to arrived at the same values the model is predicting.

Consider the below model. I'm trying to understand the relations between features insurance charges, age and if a client is or not a smoker.

Notice age variable has been pre-processed (mean centered).

import pandas as pd
import statsmodels.formula.api as smf

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
model1 = smf.ols('charges~I(age - np.mean(age)) * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['I(age - np.mean(age))'], params['I(age - np.mean(age)):smoker[T.yes]']
x1 = (insurance['age'] - np.mean(insurance['age']))
# two lines with diff intercept and slopes
y_hat_non = b0 + b1 * x1 
y_hat_smok = (b0  + b2) + (b1 + b3) * x1

Now when I generate new data and apply the predict method, I'll arrive at different values when trying to compute these manually. Take for example index 0 and index 2 ,I would expected the prediction values to be similar to the output below, but these are really far off.

Am I missing something regarding the feature transformation done when fitting the model?

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

idx_0 = (b0+b2) + (b1+b3) * 19
# 38061.1
idx_2 = b0 + b1 * 43
# 19878.4

fit1.predict(new_data)
0    27581.276650
1    10168.273779
2    10702.771604


Francisco
  • 492
  • 4
  • 19
  • 1
    background information https://patsy.readthedocs.io/en/latest/stateful-transforms.html – Josef May 09 '21 at 14:52

1 Answers1

3

I suppose you want to center the age variable , this I(age - np.mean(age)) works, but when you try to predict, it will re-evaluate age again according to the mean in your prediction data frame.

Also when you multiply by the coefficients, you have to multiply it by the centered value (i.e age - mean(age)) not the raw values.

It doesn't hurt to create another column with the centered age:

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_std=False)

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance['age_c'] = sc.fit_transform(insurance[['age']])

model1 = smf.ols('charges~age_c * smoker', data=insurance)
fit1 = model1.fit()
params = fit1.params
# get params
b0, b2, b1, b3 = params['Intercept'], params['smoker[T.yes]'], params['age_c'], params['age_c:smoker[T.yes]']

And you can predict, by using the scaler from before onto the age column:

new_data = pd.DataFrame({'age': {0: 19, 1: 41, 2: 43}, 
                        'smoker': {0: 'yes', 1: 'no', 2: 'no'}})

new_data['age_c'] = sc.transform(new_data[['age']])

new_data

   age smoker      age_c
0   19    yes -20.207025
1   41     no   1.792975
2   43     no   3.792975

Check:

idx_0 = (b0+b2) + (b1+b3) * -20.207025
# 26093.64269247414
idx_2 = b0 + b1 * 3.792975
9400.282805032146

fit1.predict(new_data)
Out[13]: 
0    26093.642567
1     8865.784870
2     9400.282695
StupidWolf
  • 45,075
  • 17
  • 40
  • 72