-1

here am using ztest built-in function within statsmodels to do single hypothesis test , however If I want to run many separate hypothesis tests - on many different columns - to test say the difference between two medians or two means , then it becomes cumbersome when doing it one by one , Is there faster and efficient way (memory and time wise) to run n number of these tests , to be more specific, say we have a dataframe of n columns , and I wanna test the difference between a mean or median return of certain trading days or (sequence of them) for a certain ticker versus the overall mean of that ticker over some period of time say 5 years (with daily values), now in the standard case , one would use

from statsmodels.stats.weightstats import ztest

ztest_Score, p_value = ztest(df_altenative['symbol is here'], df_null , alternative='two-sided')

where of course df_null above is scalar quantity(say daily average return for the entire period), and df_alternative is a column within a larger dataframe of tickers , and it holds the mean or median of your sequence trading days , then , how one can do this iterative procedure in just one line of code if possible where it goes over each one of these separate columns within my data frame and the corresponding associated mean or median value and compare them to decide on which hypothesis to be rejected or not ?

best regards

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • most hypothesis tests in statsmodels are vectorized for this case. It should work columnwise if the data has observation in rows and multiple columns. – Josef Jun 09 '22 at 12:50
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – another victim of the mouse Jun 09 '22 at 15:29
  • @anothervictimofthemouse I believe the question is quite explanatory at this stage I don't want either to provide too much details due to the nature of the project at hand – user14038152 Jun 12 '22 at 09:07

1 Answers1

0

First, the one-sample hypothesis test is vectorized. Here I assume the value under the null is 0:

from statsmodels.stats.weightstats import ztest
x = np.random.randn(100, 4)
​
ztest_Score, p_value = ztest(x, value=0 , alternative='two-sided')
ztest_Score, p_value
(array([1.69925429, 0.5359994 , 0.05777533, 0.78699997]),
 array([0.08927128, 0.59195896, 0.95392759, 0.43128188]))

[ztest(x[:, i], value=0 , alternative='two-sided') for i in range(x.shape[1])]
[(1.699254292717283, 0.0892712806133958),
 (0.5359994032597257, 0.5919589628688362),
 (0.057775326408478586, 0.953927592014832),
 (0.7869999680163862, 0.43128188488265284)]

Second, the two sample test is vectorized with appropriate numpy broadcasting. The following compares each column of the first sample to the second sample y,

y = np.random.randn(100)
statistic, p_value = ztest(x, y, alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
 array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))

[ztest(x[:, i], y, alternative='two-sided') for i in range(x.shape[1])]
[(1.364454734896, 0.17242449122265047),
 (0.5062244362943313, 0.6126991023616855),
 (0.15362676881725684, 0.8779040290306083),
 (0.6474168385742498, 0.5173622008385331)]

statistic, p_value = ztest(x, y[:, None], alternative='two-sided')
statistic, p_value
(array([1.36445473, 0.50622444, 0.15362677, 0.64741684]),
 array([0.17242449, 0.6126991 , 0.87790403, 0.5173622 ]))

To case in the question:

The two sample case cannot have a single observation in one of the samples. The ztest needs to compute the variance for the samples to compute the inferential statistics like p-values. Specifically, the ztest (or ttest) needs to compute the standard error of the mean estimate of both samples. This depends on the sample sizes. If a sample has only a single observation, then pooled variance is used but the standard error of the mean will be very large.

So, the option is to use either the one-sample z-test, which assumes that the second "mean" has no uncertainty, or to use the two sample test with the full data series as second sample, which will compute the standard error of its mean from the sample.

Josef
  • 21,998
  • 3
  • 54
  • 67
  • thanks for the elaborate answer , but what test do you use in case you are faced with data where median is far away from the mean ? these tend to actually datasets coming from skewed families like student-t test or even Pareto test , basically to test if the difference in **medians** or in other two corresponding **quantiles** is statistically significant , a case in example median and mean return are on many days are actually different at least mathematically , your input is highly welcomed – user14038152 Jun 12 '22 at 11:06
  • That's a different question than vectorizing a t-test. There are other tests for distribution far from normal or symmetric, e.g. transforming the data, use a nonparametric test like brunner-munzel rank test, using trimmed mean, – Josef Jun 12 '22 at 12:16
  • I see your point I have checked this test you suggested which I have never used before , on a different note though , something related to my original question is in case you have applied the 11bygroup property ** in pandas to find more about your data and how sub-groups differ in their behavior , however due to the nature at problem , my – user14038152 Jun 14 '22 at 08:56