I am trying to run a linear model on my data using statsmodels. My dataframe looks like the following:
0 Group Age Education
3_0001 190.8 1.0 47 12
3_0002 482.1 1.0 44 16
4_0003 144.1 0.0 38 18
4_0004 205.6 0.0 51 15
The first column is the index. The second column header is a 0 with several leading spaces. There are 88 rows of data. My code is as follows:
import statsmodels.formula.api as sm
formula = "'" + list(df)[0] + " ~ " + list(df)[1] + "'"
model = sm.ols(formula, data=df).fit()
I am getting an error message that says:
Traceback (most recent call last):
File "AUC.py", line 109, in <module>
model = sm.ols("'"+formula+"'", data=nodeDF_clean).fit()
File "/usr/local/lib64/python3.6/site-packages/statsmodels/base/model.py", line 169, in from_formula
missing=missing)
File "/usr/local/lib64/python3.6/site-packages/statsmodels/formula/formulatools.py", line 65, in handle_formula_data
NA_action=na_action)
File "/usr/local/lib/python3.6/site-packages/patsy/highlevel.py", line 310, in dmatrices
NA_action, return_type)
File "/usr/local/lib/python3.6/site-packages/patsy/highlevel.py", line 169, in _do_highlevel_design
return_type=return_type)
File "/usr/local/lib/python3.6/site-packages/patsy/build.py", line 893, in build_design_matrices
rows_checker.check(value.shape[0], name, origin)
File "/usr/local/lib/python3.6/site-packages/patsy/build.py", line 795, in check
raise PatsyError(msg, origin)
patsy.PatsyError: Number of rows mismatch between data argument and ' 0 ~ Group' (88 versus 1)
' 0 ~ Group'
^^^^^^^^^^^^^^^^^
I'm using patsy 0.5.1. and python 3.6.8. I tried renaming the first column to get rid of the leading spaces. I have tried many many different iterations of the ols formula, all with the same error. What am I doing wrong? Thanks in advance.