0

I was wondering if I could use Hypothesis Testing against trainning and testing data, after splitting my dataset.

My objective is to check if both of the data samples group are well balanced, distributed and so Will provide a Nice environment for the ML model to be applied.

If so, I would expect the H0 (null hypothesis) to be accepted i.e I hope testing data is a "microcosm" of trainning data

Or

I expect the H1 (alternative hypothesis) to be accepted i.e for the sake of checking the "foundations" of my ML environment, I should expect to find differences between both samples?

Assuming my data points have more than 1000 data points, they follow a Gaussian distribution and are independent, would Z-test be a good strategy?

Georgy
  • 12,464
  • 7
  • 65
  • 73
jaymzleutz
  • 155
  • 2
  • 10

1 Answers1

1

Yes, you can run a hypothesis test to essentially "validate" that both test and train data from the "the same distribution". To do so, you could implement a hypothesis test that sets:

H_0: Train and test data come from the same distribution
H_1: Train and test data come do not come from the same distribution

To do so, you don't need to necessarily make assumptions about the shape of the data (e.g. that it comes from a Gaussian distribution), just pick a test appropriate for the type of data you're dealing with (categorical, numeric continuous, numeric discrete, etc). For example, you could apply the Kolmogorov–Smirnov test or the Kruskal–Wallis test (both are implemented in scipy.stats, e.g. the scipy.stats.kstest). I wouldn't recommend the Z-test (or the t-test in fact), as all it's usually used to compare whether the means of two samples are the same, not that they come from the same distributions necessarily.

It should be noted that although you mention test and train data as if you're comparing them on a single dimension, if you have multiple features/columns, each pair of columns should be compared separately. As a real life example, a subset of students selected "presumably randomly" from a school could have the same height (or come from "the same distribution of heights") as the rest of the students, but they could have completely different grades from them.

Finally, just to note that in formal hypothesis testing language you cannot "accept" a null hypothesis, but only "fail to reject it" (see here on Crossvalidated).

tania
  • 2,104
  • 10
  • 18
  • Thank you for such a nice support, @tania! If my model is supposed to predict one single target variable, would it be wrong to run the tests against this specific variable in the train data and in the test data as well? – jaymzleutz Nov 03 '20 at 17:58
  • 1
    Ah, I see. I imagined you meant the X variables. Yes, you can run a test that compares the distribution of the Y variable in the exact same way. – tania Nov 03 '20 at 18:00
  • I will accept your kind and objective explanation as sufficient. Would you, please, allow me to ask only one more thing? In my hypothesis test, can I make any other assumption when working when H0 and H1 on train and data set, or only the one about "same distribution"? Thank you again. – jaymzleutz Nov 03 '20 at 18:05
  • @jaymzleutz you can make hypotheses about any other statistic of `train` and `test` you want and design a test accordingly. In A/B testing, for example, we usually just compare that the _mean_ of two samples is the same or not, but that might not be enough if you have very skewed distributions. You can make a hypothesis about the standard deviation being the same, the median being the same, etc. In fact, the non-parametric tests I mentioned take "the same distribution" to mean the percentiles or ranks to be similar. I suggest you look up "prior probability shift" or "label shift" for ideas. – tania Nov 03 '20 at 18:21