3

I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):

ID    Cycle_duration    Average_support_phase    Average_swing_phase    Label
1               23.1                     34.3                   47.2        1
2               27.3                     38.4                   49.5        1
3               25.8                     31.1                   45.7        1
4               24.5                     35.6                   41.9        1
...

So far this is what i'm doing:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('features_total.csv')

for variable in df.columns:
    model = ols('{} ~ Label'.format(variable), data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    print(anova_table)

Which yields:

    sum_sq    df         F    PR(>F)
Label     0.124927   2.0  2.561424  0.084312
Residual  1.731424  71.0       NaN       NaN
              sum_sq    df         F    PR(>F)
Label      62.626057   2.0  4.969491  0.009552
Residual  447.374788  71.0       NaN       NaN
              sum_sq    df         F    PR(>F)
Label      62.626057   2.0  4.969491  0.009552
Residual  447.374788  71.0       NaN       NaN

I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:

                             sum_sq     df         F    PR(>F)
          Cycle_duration   0.1249270   2.0  2.561424  0.084312
                Residual   1.7314240  71.0       NaN       NaN
   Average_support_phase   62.626057   2.0  4.969491  0.009552
                Residual  447.374788  71.0       NaN       NaN
     Average_swing_phase   62.626057   2.0  4.969491  0.009552
                Residual  447.374788  71.0       NaN       NaN

I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?

I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.

underclosed
  • 67
  • 1
  • 8

1 Answers1

3

You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:

keys = []
tables = []
for variable in df.columns:
    model = ols('{} ~ Label'.format(variable), data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)

    keys.append(variable)
    tables.append(anova_table)

df_anova = pd.concat(tables, keys=keys, axis=0)

Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.

busybear
  • 10,194
  • 1
  • 25
  • 42
  • Thank you so much, this did exactly what i wanted! As for your suggestion, do you mean post hoc testing? I am now trying to run Tukey HSD from statsmodels aswell, in the same fashion as i did for the Anova and trying to somehow create a loop to automate the analysis. In the end i would like to add the Tukey HSD results to the table you just helped me with. – underclosed Sep 04 '19 at 03:34
  • Not exactly. TukeyHSD would account for comparisons between each of your groups (Label) within each individual ANOVA. I'm referring to the multiple ANOVA tests you are performing. Much of this might depend on your setup and actual experimental questions though. You can find more information over at [Cross Validated](https://stats.stackexchange.com/) site. There have been numerous posts I'm sure related to this. – busybear Sep 04 '19 at 03:41