2

There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?

In stats model it would be done like this

import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())

To return the fit with the coefficients and the error on them

For Sklearn you can use the TransformedTargetRegressor

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)

But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?

EDIT

I found a good example for the special case I care about here

https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm

Keith
  • 4,646
  • 7
  • 43
  • 72
  • https://stackoverflow.com/help/minimal-reproducible-example – Paul H Aug 03 '21 at 22:27
  • @PaulH I did what I could to help Illustrate the issue better. It is a functionality issue so I am not exactly sure how to illustrate it best. – Keith Aug 03 '21 at 23:51
  • What functionality do you get from `TransformedTargetRegression` that you would not get from transforming the outcome yourself? – coffeinjunky Aug 05 '21 at 23:05
  • @coffeinjunky ols_result.summary() gives the results in the transformed target units if i transform manually. This is how I get the error on the coefficiencts. I need them in the non-transformed units. If there was a way to transform that or even just the coefficients and their error then that would be a solution worth the bounty. – Keith Aug 05 '21 at 23:39
  • Just to clarify, you are interested in the coefficients and their standard error as they relate to the untransformed outcome after fitting the regression on the transformed outcome. You say that you would normally get this from `TransformedRegressor` with the exception of the standard errors. Is this correct so far? If so, just for clarity's sake, could you add the exact commands you would use to obtain your desired coefficients from `TransformedTargetRegressor`? – coffeinjunky Aug 06 '21 at 12:03
  • Also, do you want to use this for statistical inference on the parameters, for interpretation of the coefficients, or for prediction purposes? – coffeinjunky Aug 06 '21 at 12:07

2 Answers2

1

Just to add a lengthy comment here, I believe that TransformedTargetRegressor does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict method is called. It does not express the coefficients in units of the untransformed outcome.

Example:
import pandas as pd
import statsmodels.api as sm

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
create some sample data:
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names

X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
Sklearn first:
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)

print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
    print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
Statsmodels on transformed outcome:
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)

#const               -0.458678
#sepal length (cm)    0.356758
#sepal width (cm)    -0.296294
#dtype: float64

You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor yields the same coefficients as statsmodels.OLS with the transformed outcome. TransformedTargetRegressor does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.

What to do instead?

If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).

Example:
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] +=  0.018 # this is one sigma


# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
    return inv_func(data.dot(coeffs))

# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)

Then you can say e.g. about the mean shift in the untransformed outcome:

variant1.mean()-original.mean()
#0.2523083548367202
coffeinjunky
  • 11,254
  • 39
  • 57
  • I am doing a difference of difference analysis. I have three binary input features. I want to understand if they have positive or negative effect on the untransformed target. This article lead me to believe such a transform was possible https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm The target is in dollars so it would be nice to understand how the treatment of the experiment (the third feature) changes the revenue in dollars. Is this possible? – Keith Aug 06 '21 at 16:59
  • 1
    For this case, yes, you can do this by exploiting the property that `exp(sum of x_i)` is the product of `exp(x_i)`s. Essentially, `exp(ln(y)) = exp(alpha + beta_1*x1 + beta_2*x2) = exp(alpha) * exp(beta_1*x1) * exp(beta_2*x2)`. . – coffeinjunky Aug 06 '21 at 17:21
  • 1
    With continuous variables, you would need to chose appropriate values for x1 and x2 (see Stata's marginal effects options for GLMs/LDV models). In your case, this is easy as you have a dummy, so it will only be `exp(beta_2)`. You should get the same effect from comparing the predictions with the dummy set to one for all and the dummy set to 0 for all. In either case, I don't think that `TransformedTargetRegressor` does any of this for you unless I overlooked something. @Keith – coffeinjunky Aug 06 '21 at 17:34
  • OK great. So I can just exp() my coeff. What about the STD on those coeff? Is it reasonable to just apply the exp()? – Keith Aug 06 '21 at 17:38
  • Or would it be better to apply the exp() function (specifically np.expm1) to the CL limits which I can get via conf_int() in statsmodel? – Keith Aug 06 '21 at 17:50
  • 1
    Just to be really superclear and to make sure we are barking down the right tree here: if you look at that formula again, you see that this is all multiplicative and not additive. And if you read the Stata article really carefully, they talk about the geometric mean instead of the arithmetric mean. Are you certain that this is what you want in the end? If not, I would still go down the route of using the predicted values and simply varying the values of the regressor. As to the standard errors or CI, I am almost certain that this is not quite as straightforward. – coffeinjunky Aug 06 '21 at 18:15
  • 1
    Can you describe how exactly you want to use the standard errors? Is this for a paper? The inference in the transformed model is valid, so the statistical tests are fine as they are. @Keith – coffeinjunky Aug 06 '21 at 18:16
  • This is not for a paper. I work in business and this is to make a decision about our business model. After a bunch of manipulation to control for several things we are left with splits Control/Test and before/after. This is ideal for a difference of difference analysis. There are three binary features. The first two model Control/Test and before/after. The third is for Test & after when the treatment is really applied. After the fit the 3rd coefficient will tell me if the net effect of the treatment. If it is statistically significantly positive we made more money. – Keith Aug 06 '21 at 19:03
  • This is all straightforward until I use np.log1p() to get normal data and a better fit. I no longer know how to interpret the coefficient on the 3rd feature. So I do not know if I can report to the business if the treatment is something we should keep doing. – Keith Aug 06 '21 at 19:06
  • 1
    Okay. So, if your coefficient is statistically significant (using appropriate standard errors), you are done with the inference. The remaining question is if it is economically significant. Here, the exercise described above seems to give you an answer, doesn't it? Use the differences in the predicted values (in $) with the relevant dummies/interactions=1 vs those with interaction=0. Or even familiarise yourself with the geometric mean story and calculate exp(beta). It looks like you don't need the standard errors here. – coffeinjunky Aug 06 '21 at 19:25
  • 1
    I think that works. For statistical significance I do not think I need to transform back for interpretation since np.log1p(y) is monotonic and goes through (0,0). If it is significantly different from 0 then its transform must be. – Keith Aug 06 '21 at 19:41
0

In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error @grisaitis provided a great answer explaining the main concepts behind it.

If you only want to use a plug-and-play function that will work with sciait-learn you can use this:

def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
                        y_true: 'np.ndarray', X: 'np.ndarray'):
    """Function that calculates the standard deviation of the coefficients of 
    a linear regression. 

    Parameters
    ----------
    reg : sklearn.linear_model.LinearRegression
        LinearRegression object which has been fitted 
    y_true : np.ndarray
        array containing the target variable
    X : np.ndarray
        array containing the features used in the regression

    Returns
    -------
    beta_std
        Standard deviation of the regression coefficients 
    """
    y_pred = reg.predict(X) # get predictions
    errors = y_true - y_pred # calculate residuals
    sigma_sq_hat = np.var(errors) # calculate residuals std error

    sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T @ X)
    
    return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances
henrique39
  • 197
  • 8
  • So your solution would be to use sklearn and write code for the errors rather than use statsmodel and transform manually? That is the direction I was leaning. – Keith Aug 06 '21 at 14:52
  • Yes, but it is just because I am more familiar with the Scikit-learn API – henrique39 Aug 06 '21 at 15:03
  • This does not seem to do the same thing that @grisaitis's method does. I will use his – Keith Aug 06 '21 at 15:59