Train/test split preserving class proportions in each split

Question

Suppose I wanted to split my NER dataset that looks like this:

Data: "Jokowi is the president of Indonesia"
Label: ['B-Person', 'O', 'O', 'O', 'O', 'Country']

Is there any python library or algorithm that makes sure that each class distribution for the train and test dataset is the same? any suggestions would be appreciated

Can you add more information about the dataset (Maybe a link) and some explanation about the data schema? — mpSchrader, Oct 14 '20 at 12:48
The data is something like this `https://raw.githubusercontent.com/rifoag/absa-coextraction/master/dataset/train_4k.txt` But you can ignore the third column. The data is a sentence, containing a review for a hotel. I wanted to split by the sentence and keep the label stratified, but the problem is the label is at the token-level. Thank you for asking — Rifo Genadi, Oct 15 '20 at 04:29

yatu · Answer 1 · 2020-10-14T13:13:41.213

3

You have sklearn's StratifiedShuffleSplit to do exactly that. From the docs:

The folds are made by preserving the percentage of samples for each class.

StratifiedShuffleSplit returns a generator, containing the indices to split your dataframe into train and test. Here's a sample use case, making it clear that the class proportions are indeed preserved in each split:

from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns

X = np.random.randint(0,5,(1200,2))
y = np.random.choice([0,1],size=(1200,), p=[0.8,0.2])

sss = StratifiedShuffleSplit(n_splits=2, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, y))

fig, axes = plt.subplots(1,2, figsize=(10,5))
for split, title, ax in zip([train_index, test_index], 
                     ['Train split', 'Test split'],
                     axes.flatten()):
    sns.countplot(y[split], ax=ax).set_title(title)

edited Oct 14 '20 at 13:13

answered Oct 14 '20 at 12:55

yatu

86,083
12
84
139

I am sorry I forgot to provide that my intention is to stratify each individual label: So let's say B-ASPECT label proportion in train set is 15%, the label proportion on the test set for B-ASPECT should be similar. I am not sure sklearn's StratifiedShuffleSplit could do that. But thank you for the answer, I will investigate it. – Rifo Genadi Oct 15 '20 at 04:37
Have a look at the histograms in the answer, `StratifiedShuffleSplit` is doing exactly that. The proportions of values in the train and test set, are the same @RifoGenadi – yatu Oct 15 '20 at 06:39
My concern is, the data is a sentences (["Jokowi is the President of Indonesia", "I'm with Michael here", ...] and the label is in the token-level (so the y is a label sequence [['B-Person', 'O', 'O', 'O', 'O', 'B-Country'], ['O, 'O', 'B-Person', O']]. If I do exactly like that, every unique sequence will be treated as a new class, so it doesn't work on this one. – Rifo Genadi Oct 16 '20 at 02:20
Well this seems quite more complicated then. Also you didn't mention that you have a multioutput problem. Could you ask a new question for that? @RifoGenadim Perhaps with a more complete example, and a clearer explanation – yatu Oct 16 '20 at 06:13

score 0 · Answer 2 · answered Oct 14 '20 at 12:49

0

You can explore StratifiedShuffleSplit available in Scikit learn library.

answered Oct 14 '20 at 12:49

Praks

67
1
1
4

Train/test split preserving class proportions in each split

2 Answers2

Linked