1

I am running a series of OLS regressions in Python, where I am using standard errors that I calculate using a custom function.

I am now exporting my regression results into tables, and plan on using the stargazer package (linked here). However, stargazer relies on regression results being calculated via the statsmodels package.

I am having trouble incorporating my custom standard errors into statsmodels, and hence cannot export using stargazer. I have tried looking if there is a way to overwrite default standard errors in statsmodels, but have not been successful.

I've provided example below:

import pandas as pd
from sklearn import datasets
import statsmodels.api as sm
from stargazer.stargazer import Stargazer

#load data
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data)
df.columns = ['Age', 'Sex', 'BMI', 'ABP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6']
df['target'] = diabetes.target

#run regressions with statsmodels
est = sm.OLS(endog=df['target'], exog=sm.add_constant(df[df.columns[0:4]])).fit()

#custom standard errors function, returns a K-by-1 vector where K is the number of predictors
#I return a vector of ones here for simplicity
def custom_standard_errors(endog, exog):
    return [1 for i in range(len(exog.columns))]
    
#export regression table with stargazer
stargazer = Stargazer([est])

The stargazer object is displayed below. My goal is to overwrite the standard errors in the parentheses with output from custom_standard_errors(). As such, every value in parentheses should be 1, in this example.

Stargazer output

Derek O
  • 16,770
  • 4
  • 24
  • 43
lithium123
  • 87
  • 1
  • 9
  • 1
    can you provide an example of some of your code, or a minimum reproducible example of some of the custom standard errors you want to incorporate into `statsmodels / stargazer`? you could also use a small toy data set if sharing the actual data you're using is problematic – Derek O Feb 14 '23 at 15:33
  • 1
    my feeling is that you could probably use inheritance to add new functionality to some of the class methods in the packages and libraries you're working with – as long as I have a bit more guidance or an example of what exactly you're looking for, i'd be happy to help! – Derek O Feb 14 '23 at 15:40
  • 1
    Example now provided! – lithium123 Feb 17 '23 at 01:42

1 Answers1

1

This required some digging, but I believe I have a working solution.

When you create an instance of the Stargazer class such as your object stargazer, most of the regression results are extracted from the est object which is of type ResultsWrapper (from statsmodels).

Three instance methods called extract_data, _extract_feature, and extract_model_data are called and extract_model_data does a lot of the heavy lifting: it specifically extracts features stored in statsmodels_map, which looks like the following:

statsmodels_map = {'p_values' : 'pvalues', ## ⭠ and this too
                   'cov_values' : 'params',
                   'cov_std_err' : 'bse', ## ⭠ we want to modify this 
                   'r2' : 'rsquared',
                   'r2_adj' : 'rsquared_adj',
                   'f_p_value' : 'f_pvalue',
                   'degree_freedom' : 'df_model',
                   'degree_freedom_resid' : 'df_resid',
                   'nobs' : 'nobs',
                   'f_statistic' : 'fvalue'
                   } 

What we can do is create a child class called SuperStargazer that inherits all of the instance methods from Stargazer, and then override the extract_model_data to set cov_std_err using your custom_standard_errors function. Although it's not strictly necessary, we'll set custom_standard_errors as an instance attribute of the SuperStargazer class, as this allows you to use different custom functions if you define different instances of the SuperStargazer class.

Update: we'll also update the p-values as they are related to the new custom standard errors. This involves recalculating the t-values, and then applying the formula: p_value = 2*(1 - t.cdf(abs(t_value), dof))

from statsmodels.base.wrapper import ResultsWrapper
from statsmodels.regression.linear_model import RegressionResults
from scipy.stats import t
from math import sqrt
from collections import defaultdict
from enum import Enum
import numbers
import pandas as pd

class SuperStargazer(Stargazer):
    def __init__(self, models, custom_standard_errors, **kwargs):
        self.custom_standard_errors = custom_standard_errors
        super().__init__(models=models, **kwargs)
    
    def extract_model_data(self, model):
        # For features that are simple attributes of "model", establish the
        # mapping with internal name (TODO: adopt same names?):
        statsmodels_map = {# 'p_values' : 'pvalues',
                           'cov_values' : 'params',
                           # 'cov_std_err' : 'bse',
                           'r2' : 'rsquared',
                           'r2_adj' : 'rsquared_adj',
                           'f_p_value' : 'f_pvalue',
                           'degree_freedom' : 'df_model',
                           'degree_freedom_resid' : 'df_resid',
                           'nobs' : 'nobs',
                           'f_statistic' : 'fvalue'
                           }

        data = {}
        for key, val in statsmodels_map.items():
            data[key] = self._extract_feature(model, val)

        if isinstance(model, ResultsWrapper):
            data['cov_names'] = model.params.index.values
            endog, exog = model.model.data.orig_endog, model.model.data.orig_exog
            custom_std_err_data = self.custom_standard_errors(endog, exog)
            data['cov_std_err'] = pd.Series(
                index=exog.columns,
                data=custom_std_err_data
            )
            data['t_values'] = data['cov_values'] / data['cov_std_err']
            dof = len(endog) - 2
            data['p_values'] = pd.Series(
                index=data['t_values'].index,
                data=2 * (1 - t.cdf(abs(data['t_values']), dof))
            )
            
        else:
            # Simple RegressionResults, for instance as a result of
            # get_robustcov_results():
            data['cov_names'] = model.model.data.orig_exog.columns

            # These are simple arrays, not Series:
            for what in 'cov_values', 'cov_std_err':
                data[what] = pd.Series(data[what],
                                       index=data['cov_names'])

        data['conf_int_low_values'] = model.conf_int()[0]
        data['conf_int_high_values'] = model.conf_int()[1]
        data['resid_std_err'] = (sqrt(sum(model.resid**2) / model.df_resid)
                                 if hasattr(model, 'resid') else None)

        # Workaround for
        # https://github.com/statsmodels/statsmodels/issues/6778:
        if 'f_statistic' in data:
            data['f_statistic'] = (lambda x : x[0, 0] if getattr(x, 'ndim', 0)
                                   else x)(data['f_statistic'])

        return data

Then when we can create an instance of SuperStargazer, it will render the following table (in JupyterLab in my case, but you can also call stargazer.render_html() to store the html string for later use)

stargazer = SuperStargazer(
    models=[est], 
    custom_standard_errors=custom_standard_errors
)

enter image description here

Derek O
  • 16,770
  • 4
  • 24
  • 43
  • Thank you! I may be able to make this correction myself, but realize it may be faster if I note this here as you may be able to help. The improvement I have in mind is that I would like the statistical significance stars above the coefficients to also change depending on the realized p-values. In particular, all coefficients in the example above should have *** (as the p-values are past the 1% level) – lithium123 Feb 20 '23 at 04:45
  • 1
    I am happy to have helped! Out of curiosity, are you calculating the p-values in a different manner from `statsmodels`? When I look into `superstargazer.model_data`, the calculated p-values seem to match those indicated by the table. that being said, [this](https://github.com/mwburke/stargazer/blob/62135cf490911b13901099d483fc75db996f9002/stargazer/stargazer.py#L351-L361) is where the number of asterisks is being generated in the `stargazer` class – Derek O Feb 20 '23 at 05:03
  • 1
    No, I'm calculating the p-values in the same way as in statsmodels (given the coefficient estimate, custom standard error, and the degrees of freedom). – lithium123 Feb 20 '23 at 16:42
  • 1
    in that case, I think a "patch" should work best – we can set `t_values = cov_values / cov_std_err` in the same place where we set the custom standard errors, then use something like `p_values = 2 * (1 - t.cdf(abs(data['t_values']), dof)` – Derek O Feb 20 '23 at 18:29