How to loop through a pandas dataframe to run an independent ttest for each of the variables?

Question

I have a dataset that consists of around 33 variables. The dataset contains patient information and the outcome of interest is binary in nature. Below is a snippet of the data.

The dataset is stored as a pandas dataframe

df.head()

ID     Age  GAD  PHQ  Outcome
1      23   17   23      1
2      54   19   21      1
3      61   23   19      0
4      63   16   13      1
5      37   14   8       0

I want to run independent t-tests looking at the differences in patient information based on outcome. So, if I were to run a t-test for each alone, I would do:

age_neg_outcome = df.loc[df.outcome ==0, ['Age']]
age_pos_outcome = df.loc[df.outcome ==1, ['Age']]

t_age, p_age = stats.ttest_ind(age_neg_outcome ,age_pos_outcome, unequal = True)

print('\t Age: t= ', t_age, 'with p-value= ', p_age)

How can I do this in a for loop for each of the variables?

I've seen this post which is slightly similar but couldn't manage to use it.

Python : T test ind looping over columns of df

score 2 · Accepted Answer · answered Jun 06 '21 at 23:57

2

You are almost there. ttest_ind accepts multi-dimensional arrays too:

cols = ['Age', 'GAD', 'PHQ']
cond = df['outcome'] == 0

neg_outcome = df.loc[cond, cols]
pos_outcome = df.loc[~cond, cols]

# The unequal parameter is invalid so I'm leaving it out
t, p = stats.ttest_ind(neg_outcome, pos_outcome)
for i, col in enumerate(cols):
    print(f'\t{col}: t = {t[i]:.5f}, with p-value = {p[i]:.5f}')

Output:

    Age: t = 0.12950, with p-value = 0.90515
    GAD: t = 0.32937, with p-value = 0.76353
    PHQ: t = -0.96683, with p-value = 0.40495

answered Jun 06 '21 at 23:57

Code Different

90,614
16
144
163

Hey, thanks for your reply on this. This works perfectly. I have encountered another issue though where when there are too many NANs in a variable the output comes out as t = nan, p = nan. I managed to find a solution to it on this thread https://stackoverflow.com/questions/37022888/t-test-in-scipy-with-nan-values, where they resolved the issue by dropping the nans before passing them to the t-test. Any suggestions on how to implement this in this loop? – ummendial Jun 09 '21 at 15:19

How to loop through a pandas dataframe to run an independent ttest for each of the variables?

1 Answers1