Using Variable instead of column name in Statsmodel formula API

Question

I have a variable cols that contain list of column name for my table.

Now I want to run an regression on my table by looping through different columns of cols variable.

I am trying to use Statsmodel Formula API (Patsy) but am unable to construct a proper formula

The code that I am trying right now is:

model = smf.ols(formula="Annual_Sales ~ Q('cols')", data=df).fit()

But this obviously is not working as cols is not present in my df table.

Any suggestion how can I do this, preferably by for loop as I have 150 columns and I can't manually enter all those names in formula.

Thank You

score 1 · Answer 1 · answered Jun 28 '22 at 07:02

You don't even need a loop, as you can convert that list into a string and insert that string into the formula. Here is an example using a statsmodels sample dataset:

import statsmodels.api as sm
import statsmodels.formula.api as smf

df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

cols = ['Literacy', 'Wealth', 'Region']  # list of independent variables
cols_str = " + ".join(cols)
print(cols_str)  # 'Literacy + Wealth + Region'

mod = smf.ols(formula=f"Lottery ~ {cols_str}", data=df)
print(mod.fit().summary())

#                             OLS Regression Results                            
# ==============================================================================
# Dep. Variable:                Lottery   R-squared:                       0.338
# Model:                            OLS   Adj. R-squared:                  0.287
# Method:                 Least Squares   F-statistic:                     6.636
# Date:                Tue, 28 Jun 2022   Prob (F-statistic):           1.07e-05
# Time:                        06:59:17   Log-Likelihood:                -375.30
# No. Observations:                  85   AIC:                             764.6
# Df Residuals:                      78   BIC:                             781.7
# Df Model:                           6                                         
# Covariance Type:            nonrobust                                         
# ===============================================================================
#                   coef    std err          t      P>|t|      [0.025      0.975]
# -------------------------------------------------------------------------------
# Intercept      38.6517      9.456      4.087      0.000      19.826      57.478
# Region[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
# Region[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
# Region[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
# Region[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
# Literacy       -0.1858      0.210     -0.886      0.378      -0.603       0.232
# Wealth          0.4515      0.103      4.390      0.000       0.247       0.656
# ==============================================================================
# Omnibus:                        3.049   Durbin-Watson:                   1.785
# Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
# Skew:                          -0.340   Prob(JB):                        0.260
# Kurtosis:                       2.454   Cond. No.                         371.
# ==============================================================================

# Warnings:
# [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That's not what he wants to do. – Akimon Feb 11 '23 at 23:04 — Akimon, Feb 11 '23 at 23:04

score 1 · Accepted Answer · answered Jul 06 '22 at 12:41

One way I was able to solve this problem was using String Formatting, as generally the formula written inside Statsmodel is in String format.

So if we have,

col = ["a", "b", "c", "d"]

We can write,

for i in range(0, len(col) - 1):
     for j in range(i + 1, len(col)):
       model = smf.ols(formula="Annual_Sales ~ Q('{}') + ('{}')".format(col[i], col[j]), data=df).fit()

This will allow us to loop through the list variable col, while taking two factors at a time to create the model.

Using Variable instead of column name in Statsmodel formula API

2 Answers2