I have a large textual dataset which includes 2 columns - first one is text description and second one is of categories it belongs to. I choose a stratified sample using following method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
But I need a prove that it represents the original population. How can I prove or ensure that?
I have found that Chi2 is used for categorical data but unable to find how to apply it for textual data. Another method I have found is PCA but how can we draw PCA for textual data?
Can anyone tell how can I analyze sample vs population to ensure that it represents the original population either by using statistical testing methods or any other methods?