-1

I have a large textual dataset which includes 2 columns - first one is text description and second one is of categories it belongs to. I choose a stratified sample using following method:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

But I need a prove that it represents the original population. How can I prove or ensure that?

I have found that Chi2 is used for categorical data but unable to find how to apply it for textual data. Another method I have found is PCA but how can we draw PCA for textual data?

Can anyone tell how can I analyze sample vs population to ensure that it represents the original population either by using statistical testing methods or any other methods?

Narendra
  • 125
  • 11

1 Answers1

0

You will have to run a classifier once vs. the entire population, and ensure that the classifier cannot determine which sample came from your sample data and which came from your full data.

Create a new Database. Mark the rows that are selected as the sample rows as the class "Sample", mark the rest of the rows as the class "Regular". Now run a decision tree classifier for example with cross-validations, and ensure that your precision and accuracy are around 50% - this means that the classifier cannot distinguish between the full data and the sample data.

If the classifier can distinguish between them - this means that your sample data does not truly represent the entire data. In this case, increase the number of rows used as the sample. Do this until your model cannot distinguish between the sample and the full data.

Roee Anuar
  • 3,071
  • 1
  • 19
  • 33