Parsing a Pandas dataframe with an unknown number of columns for use in statsmodels.api

Question

I would like to create a generic script to perform linear regressions on multiple data sets. Each data set will have the same y-variable called "SM" and an unknown number of x-variables. I have been able to do this successfully if I know exactly which data will be used for the regression. For example:

import pandas
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

data = pandas.read_excel('test.xlsx')

Then, print data gives:

print data
             SM  Glass    mag
 SiO2    73.500  77.27   0.00
 TiO2     0.233   0.15   7.37
 Al2O3   11.230  11.49   0.00
 FeO*     4.240   2.85  92.46
 MnO      0.082   0.06   0.00
 MgO      0.040   0.00   0.00
 CaO      0.410   0.22   0.00
 Na2O     5.630   4.58   0.00
 K2O      4.620   3.38   0.00

Then I prepare the dataframe and do the linear regression:

y, X = dmatrices('SM ~ Glass + mag', data=data, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
print res.summary()

This all works great. BUT, I'd like to be able to import an excel file with an unknown number of columns so I can do:

y, X = dmatrices('SM ~ X1 + X2 + X3 + ... Xn', data=data, return_type='dataframe')

I can parse the data frame and pull out individual columns, but I don't know how to them put them into the formula needed to do the linear regression. Any advice is appreciated!

What information are you looking to get back. Just coefficients? R-Square? T-Stats? — piRSquared, Aug 12 '16 at 23:49
For now, just the res.summary(). I mostly just want the coefficients. — keirasan, Aug 12 '16 at 23:54
If all the other columns, except `'SM'` are your exogenous variables, and you will only be doing a linear regression (that is, it will always be `x1 + x2 + x3...` and never `x1 * x2 + x3...`), there is an easier way: `endog = df['SM']; exog=df.drop('SM', axis=0); mod = sm.OLS(endog, exog); res = mod.fit(); res.summary()`... — Kartik, Aug 13 '16 at 02:55
@Kartik ooh, I see how this would work. That's very helpful, thank you! — keirasan, Aug 14 '16 at 02:36

score 1 · Accepted Answer · answered Aug 12 '16 at 23:53

1

See if this works:

df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('abcdefghij'))

formula = '{} ~ {}'.format(df.columns[0], ' + '.join(df.columns[1:]))

formula

'a ~ b + c + d + e + f + g + h + i + j'

answered Aug 12 '16 at 23:53

piRSquared

285,575
57
475
624

Parsing a Pandas dataframe with an unknown number of columns for use in statsmodels.api

1 Answers1