2

I would like to conduct a simple t-test in python, but I would like to compare all possible groups to each other. Let's say I have the following data:

import pandas as pd

data = {'Category': ['cat3','cat2','cat1','cat2','cat1','cat2','cat1','cat2','cat1','cat1','cat1','cat2','cat3','cat3'],
        'values': [4,1,2,3,1,2,3,1,2,3,5,1,6,3]}
my_data = pd.DataFrame(data)

And I want to calculate the p-value based on a t-test for all possible category combinations, which are:

cat1 vs. cat2
cat2 vs. cat3
cat1 vs. cat3

I can do this manually via:

from scipy import stats

cat1 = my_data.loc[my_data['Category'] == 'cat1', 'values']
cat2 = my_data.loc[my_data['Category'] == 'cat2', 'values']
cat3 = my_data.loc[my_data['Category'] == 'cat3', 'values']

print(stats.ttest_ind(cat1,cat2).pvalue)
print(stats.ttest_ind(cat2,cat3).pvalue)
print(stats.ttest_ind(cat1,cat3).pvalue)

But is there a more simple and straightforward way to do this? The amount of categories might differ from case to case, so the number of t-tests that need to be calculated will also differ...

The final output should be a DataFrame with one row for each comparison and the values: category1 | category2 | p-value, in this case it should look like:

cat1 | cat2 | 0.16970867501294376
cat2 | cat3 | 0.0170622126550303
cat1 | cat3 | 0.13951958313684434
user2635656
  • 127
  • 3
  • 8
  • 1
    Is there a reason you're doing a series of t-tests, rather than a single [anova](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)? – G. Anderson Feb 17 '20 at 16:10

2 Answers2

1

Consider iterating through itertools.combinations across categories:

from itertools import combinations
...

def ttest_run(c1, c2):
    results = stats.ttest_ind(cat1, cat2)
    df = pd.DataFrame({'categ1': c1,
                       'categ2': c2,
                       'tstat': results.statistic,
                       'pvalue': results.pvalue}, 
                       index = [0])    
    return df

df_list = [ttest_run(i, j) for i, j in combinations(mydata['Category'].unique().tolist(), 2)]

final_df = pd.concat(df_list, ignore_index = True)
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

You must use multicomparison from statsmodels (there are other libraries).

from scipy import stats
import statsmodels.stats.multicomp as mc

comp1 = mc.MultiComparison(dataframe[ValueColumn], dataframe[CategoricalColumn])
tbl, a1, a2 = comp1.allpairtest(stats.ttest_ind, method= "bonf")

You will have your pvalues in:

a1[0] #pvalues
a1[2] #pvalues corrected by Bonferroni in this case
juan trinidad
  • 130
  • 2
  • 3
  • does MultiComparison utilises independent sample though? Not really a paired ttest, also as you indicated in your second last line in "stats.ttest_ind" instead of "stats.ttest_rel" – CornelioQuinto Feb 20 '23 at 16:03