I have two distributions below (Kaggle dataset: Rossman sales) that look similar visually: Sales on normal days & sales on school holiday.
However, they seems to fail z-test (hypothesis testing) in Python - why is that so?
How should I perform the statistical test (z-test) in Python? Should I use pooled
or unequalvar
(should I use same variance or different)? I also found out that switching School_hol_sales
and Normal_day_sales
in the code below yield different results and I am not sure why.
School_hol_sales = df[(df.Open==1)&(df.SchoolHoliday==1)&(df.StateHoliday=='0')&(df.Promo==0)].Sales
Normal_day_sales = df[(df.Open==1)&(df.SchoolHoliday==0)&(df.StateHoliday=='0')&(df.Promo==0)].Sales
School_hol_sales.mean(), Normal_day_sales.mean() # (6230.4, 5904.6)
School_hol_sales.std(), Normal_day_sales.std() # (2841.8, 2602.9)
# which is the correct one?
import statsmodels.stats.api as sms
cm = sms.CompareMeans(sms.DescrStatsW(School_hol_sales), sms.DescrStatsW(Normal_day_sales))
z, pval = cm.ztest_ind(alternative='larger', usevar='unequal')
print('z: {} , pval: {}'.format(z, pval))
from statsmodels.stats.weightstats import ztest
z, pval = ztest(School_hol_sales,Normal_day_sales, alternative='larger', usevar='pooled', ddof=1.0)
print('z: {} , pval: {}'.format(z, pval))
Output:
z: 28.53350149055591 , pval: 2.2504631945823565e-179
z: 30.17089944207645 , pval: 2.853425122518376e-200