How to apply statististical tests (functions) on pandas dataframe on combination of subsets of data

Question

I have dataframe which is similar to this one.

import pandas as pd
import string
import random
def generate_example_dataframe()-> pd.DataFrame:
    """
    This simple function will generate simple dataframe in long format
    """
    num = 20 # number of regions udsed in simulations
    subjects_num = 10
    random.seed(1)
    conditions = ["open", "closed"]
    groups = ["old", "young"]
    means = [1,1.5,1.25,1.75]
    regions = [f"region_{s}" for s in string.ascii_letters[:num]]
    subjects = [f"subject_{s}" for s in list(range(1, subjects_num))]

    list_of_dataframes = []
    for subject in subjects:
        for region in regions:
            lst = iter(means)
            for condition in conditions:
                for group in groups:
                    mean = next(lst)
                    values = mean + np.random.rand(num) + 0.2*random.random()
                    temp_df = pd.DataFrame({'region':[region] *num, 'group':[group] * num, 'condition':[condition] *num ,'subject':[subject] *num ,'values':values})
                    list_of_dataframes.append(temp_df)

    return pd.concat(list_of_dataframes)



# %% [markdown]
# Genereting sample dataframe is presented in the long format - one obe

# %%
df = generate_example_dataframe()
df.head(10).to_clipboard(sep=',', index=True)

Which give output like this

,region,group,condition,subject,values
0,region_a,old,open,subject_1,1.4914914311214753
1,region_a,old,open,subject_1,1.9742822483723783
2,region_a,old,open,subject_1,1.0461147549953116
3,region_a,old,open,subject_1,1.9369465073938947
4,region_a,old,open,subject_1,1.817792271839675
5,region_a,old,open,subject_1,1.4272522367426221
6,region_a,old,open,subject_1,1.129423554333859
7,region_a,old,open,subject_1,1.9021298911486018
8,region_a,old,open,subject_1,1.950500304961099
9,region_a,old,open,subject_1,1.6832358513116206

I want to do a simple t-test on values with separation by region, group and condition. (number of tests = Regions x groups x conditions) What is the most pythonic way to this? The only way I am thinking now is in a loop iterate over values of these variables and subset the big data frame.

Thank you for providing a way to reproduce the DataFrame! Am I right in thinking that each observation is independent? — Steele Farnsworth, Jul 26 '21 at 00:42

Steele Farnsworth · Answer 1 · 2021-07-26T01:29:54.320

0

from scipy.stats import ttest_ind as ttest
from itertools import combinations

df = generate_example_dataframe()
grouped_dataframes = [frame for _, frame in df.groupby(['region', 'group', 'condition'])['values']]

# for the p value
results = [ttest(*comb).pvalue for comb in combinations(grouped_dataframes, 2)]
# for the statistic
results = [ttest(*comb).statistic for comb in combinations(grouped_dataframes, 2)]

df.groupby(['region', 'group', 'condition']) will get the region x group x condition subsets for you.

I'm not sure if there's an optimized approach to performing the t-test for each combination of subsets. Please let me know if I've misunderstood what was wanted.

If this is with ttest_1samp, you could do:

from scipy.stats import ttest_1samp as ttest

array = np.stack(frame for _, frame in df.groupby(['region', 'group', 'condition'])['values'])
result = ttest(array, df['values'].mean(), axis=1)

edited Jul 26 '21 at 01:29

answered Jul 26 '21 at 01:01

Steele Farnsworth

863
1
6
15

You shouldn't really do t-tests like this, because you'll inflate the number of false positives by doing [so many comparisons](https://en.wikipedia.org/wiki/Multiple_comparisons_problem). You should use an [ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) test. If that identifies that the means are different, you can follow it by a [Tukey HSD](https://www.statsmodels.org/dev/generated/statsmodels.stats.multicomp.pairwise_tukeyhsd.html) test to identify the groups which are different. – Nick ODell Jul 26 '21 at 01:19
Thank you for your response. Please let me know if there are any issues with my alternative solution that uses `ttest_1samp`. – Steele Farnsworth Jul 26 '21 at 01:25

osint_alex · Answer 2 · 2021-07-26T01:36:30.200

I think this does what you want if I've understood correctly?

from scipy.stats import ttest_1samp
from itertools import product

pop_mean = df["values"].mean()
all_t_tests = [
    x for x in product(df.region.unique(), df.group.unique(), df.condition.unique())
]
all_t_test_groups = {
    "+".join(x): df.loc[
        (df.region == x[0]) & (df.group == x[1]) & (df.condition == x[2])
    ]["values"]
    for x in all_t_tests
}
all_t_test_values = {k: ttest_1samp(v, pop_mean) for k, v in all_t_test_groups.items()}
print(all_t_test_values)

Out:

{'region_a+old+open': Ttest_1sampResult(statistic=-15.269217874013226, pvalue=3.018447106624542e-34),...

EDIT: Building on the other answer you can also do this with a groupby! Probably a bit neater even though I like itertools.

df = generate_example_dataframe()
pop_mean = df["values"].mean()
grouped_dfs = [x for x in df.groupby(['region', 'group', 'condition'])['values']]
all_t_test_groups = {
    "+".join(x[0]): x[1]
    for x in grouped_dfs
}
all_t_test_values = {k: ttest_1samp(v, pop_mean) for k, v in all_t_test_groups.items()}

I based my alternative solution that uses `ttest_1samp` off of this, but `DataFrame.groupby` should be used to create the subsets, not manual iteration. — Steele Farnsworth, Jul 26 '21 at 01:27

score 0 · Answer 3 · answered Jul 26 '21 at 01:20

im also learn other methods from others' answers. i made a solution like this below...

# grouping

df['grouping']=df['region']+"_"+df['group']+"_"+df['condition']

for i in df.grouping.unique():
    print(i)
    t='result_'+i
    locals()[t]=stats.ttest_1samp(df.loc[df['grouping']==i,'values'],0)

How to apply statististical tests (functions) on pandas dataframe on combination of subsets of data

3 Answers3