0

I have a dataframe consisting of four columns and around 20000 rows, like this.

import pandas as pd
import numpy as np
d = {'x': [1,1,0,1,0,0,1],'BPM':[70,55,45,np.nan,35,25,np.nan],'AGE': [50, 47,21, 50,24,47,16], 'WEIGHT': [50,100,50,np.nan,np.nan,100,27]}
df = pd.DataFrame(data=d)

x BPM AGE  WEIGHT 
1  70  50  50
1  55  47  100
0  45  21  50
1  nan 24  nan
0  35  50  nan
0  25  47  100
1  nan 16  27

Is there any significant difference in "BPM" between class '1' and class '0' after matching AGE and WEIGHT?

There are two classes: 0 and 1. The number of samples is not equal in both classes. I understand that first I have to match values, then I can apply t-test. I am new to this field, so I do not understand how to proceed.

1 Answers1

1

You could calculate the t-scores by hand.

mean_bpm_df = df.groupby(['AGE','WEIGHT','x']).mean().unstack(level=-1)
mean_bpm_df.columns = ['mean_bpm_0','mean_bpm_1']
std_count_df = df.drop(columns='x').groupby(['AGE','WEIGHT']).agg(['std','count'])
std_count_df.columns = ['std_bpm','count_bpm']
t_df = (mean_bpm_df.mean_bpm_0 - mean_bpm_df.mean_bpm_1) / (std_count_df.std_bpm / np.sqrt(std_count_df.count_bpm))

Now, if you also want the p-values, those can be calculated by hand too. Assume a 2-sided t-test (you can modify this if needed).

from scipy.stats import t
p_df = pd.DataFrame(index=t_df.index, data=2*(1 - t.cdf(abs(t_df), std_count_df.count_bpm-1)))
Philip Egger
  • 326
  • 1
  • 11