0

I did a t-test analysis using scipy library and wanted to cross check with my own made t-test function. To my suprise I found out that when my series did not include any nan values my function and scipy library gave same t-value and p-value. If the series have any nan values there are some differences despite I have removed the nan values. Does anyone know what problem that may cause this issue?

from math import sqrt
from numpy import mean
from scipy.stats import t
import numpy as np
import pandas as pd
from scipy import stats

# function for calculating the t-test for two independent samples
def independent_ttest(data1, data2, alpha):
    # calculate means
    mean1, mean2 = mean(data1), mean(data2)
    # calculate standard errors
    se1, se2 = sem(data1), sem(data2)
    # standard error on the difference between the samples
    sed = sqrt(se1**2.0 + se2**2.0)
    # calculate the t statistic
    t_stat = (mean1 - mean2) / sed

    # degrees of freedom
    df = len(data1) + len(data2) - 2
    # calculate the critical value
    cv = t.ppf(1.0 - alpha, df)
    # calculate the p-value
    p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
    # return everything

    return t_stat, df, cv, p

# calculate the t test
alpha = 0.05
x = np.arange(10.)
b = x*1.1
df_x = pd.Series(x)
df_b = pd.Series(b)
df_x_nan = df_x.replace(7.0, np.nan)
df_x_nan = df_x.replace(4.0, np.nan)

print('Whithout NaN')
t_stat, df, cv, p = independent_ttest(df_x, df_b, alpha)
t_stat_scipy, p_scipy = stats.ttest_ind(df_x,df_b, nan_policy = 'omit')
print("t-test function, t_Stat: {}".format(t_stat))
print("t-test scipy, t_Stat: {}".format(t_stat_scipy))
print("t-test function, p: {}".format(p))
print("t-test scipy, p: {}".format(p_scipy))
print('===================')
print('Whith NaN')
t_stat, df, cv, p = independent_ttest(df_x_nan.dropna(), df_b, alpha)
t_stat_scipy, p_scipy = stats.ttest_ind(df_x_nan,df_b, nan_policy = 'omit')
print("t-test function, t_Stat: {}".format(t_stat))
print("t-test scipy, t_Stat: {}".format(t_stat_scipy))
print("t-test function,p: {}".format(p))
print("t-test scipy, p: {}".format(p_scipy))

Here are the outputs:

Whithout NaN
t-test function, t_Stat: -0.3161627186509306
t-test scipy, t_Stat: -0.31616271865093054
t-test function, p: 0.7555158566691087
t-test scipy, p: 0.7555158566691088
===================
Whith NaN
t-test function, t_Stat: -0.2628962556410858
t-test scipy, t_Stat: -0.2623389223791333
t-test function,p: 0.7957901706958825
t-test scipy, p: 0.7962126903526476
user3776800
  • 57
  • 1
  • 7
  • i think this is due to different rounding – warped Apr 28 '19 at 11:02
  • the length should be the same after the na are dropped so I don't think the rouding should be so much different – user3776800 Apr 28 '19 at 11:51
  • Why would the lengths be the same? You replaced things in dfx twice to make dfx_nan? Since the sample sizes between x and y become unequal, presumably numpy is using the formulas for unequal samples. Maybe you means the lengths of the dfx passed to your function and numpy are the same. But the numbers of x's and b's now differ. – Jeremy Kahan Apr 28 '19 at 12:24
  • I meant that the length is the same when calling my function and scipy function after I have dropped the nans – user3776800 Apr 28 '19 at 12:27
  • So then I think the issue is that the sed calculation you use is for equal sample sizes, but they are no longer equal and numpy accounts for that. They are almost equal, so you might be justified, but numpy appears to be worrying about it. – Jeremy Kahan Apr 28 '19 at 12:30
  • I put `equal_var = False` in `stats.ttest_ind` as parameter. Then scipy gives me `t-test scipy, p: 0.13946695298758838`. I think this explaines why I got different answers before – user3776800 Apr 28 '19 at 12:42

0 Answers0