Yes, you can run a hypothesis test to essentially "validate" that both test
and train
data from the "the same distribution". To do so, you could implement a hypothesis test that sets:
H_0: Train and test data come from the same distribution
H_1: Train and test data come do not come from the same distribution
To do so, you don't need to necessarily make assumptions about the shape of the data (e.g. that it comes from a Gaussian distribution), just pick a test appropriate for the type of data you're dealing with (categorical, numeric continuous, numeric discrete, etc). For example, you could apply the Kolmogorov–Smirnov test or the Kruskal–Wallis test (both are implemented in scipy.stats
, e.g. the scipy.stats.kstest
). I wouldn't recommend the Z-test (or the t-test in fact), as all it's usually used to compare whether the means of two samples are the same, not that they come from the same distributions necessarily.
It should be noted that although you mention test
and train
data as if you're comparing them on a single dimension, if you have multiple features/columns, each pair of columns should be compared separately. As a real life example, a subset of students selected "presumably randomly" from a school could have the same height (or come from "the same distribution of heights") as the rest of the students, but they could have completely different grades from them.
Finally, just to note that in formal hypothesis testing language you cannot "accept" a null hypothesis, but only "fail to reject it" (see here on Crossvalidated).