1

Suppose I wanted to split my NER dataset that looks like this:

Data: "Jokowi is the president of Indonesia"
Label: ['B-Person', 'O', 'O', 'O', 'O', 'Country']

Is there any python library or algorithm that makes sure that each class distribution for the train and test dataset is the same? any suggestions would be appreciated

yatu
  • 86,083
  • 12
  • 84
  • 139
Rifo Genadi
  • 29
  • 1
  • 5
  • 2
    Can you add more information about the dataset (Maybe a link) and some explanation about the data schema? – mpSchrader Oct 14 '20 at 12:48
  • The data is something like this `https://raw.githubusercontent.com/rifoag/absa-coextraction/master/dataset/train_4k.txt` But you can ignore the third column. The data is a sentence, containing a review for a hotel. I wanted to split by the sentence and keep the label stratified, but the problem is the label is at the token-level. Thank you for asking – Rifo Genadi Oct 15 '20 at 04:29

2 Answers2

3

You have sklearn's StratifiedShuffleSplit to do exactly that. From the docs:

The folds are made by preserving the percentage of samples for each class.

StratifiedShuffleSplit returns a generator, containing the indices to split your dataframe into train and test. Here's a sample use case, making it clear that the class proportions are indeed preserved in each split:

from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns

X = np.random.randint(0,5,(1200,2))
y = np.random.choice([0,1],size=(1200,), p=[0.8,0.2])

sss = StratifiedShuffleSplit(n_splits=2, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, y))

fig, axes = plt.subplots(1,2, figsize=(10,5))
for split, title, ax in zip([train_index, test_index], 
                     ['Train split', 'Test split'],
                     axes.flatten()):
    sns.countplot(y[split], ax=ax).set_title(title)

enter image description here

yatu
  • 86,083
  • 12
  • 84
  • 139
  • I am sorry I forgot to provide that my intention is to stratify each individual label: So let's say B-ASPECT label proportion in train set is 15%, the label proportion on the test set for B-ASPECT should be similar. I am not sure sklearn's StratifiedShuffleSplit could do that. But thank you for the answer, I will investigate it. – Rifo Genadi Oct 15 '20 at 04:37
  • Have a look at the histograms in the answer, `StratifiedShuffleSplit` is doing exactly that. The proportions of values in the train and test set, are the same @RifoGenadi – yatu Oct 15 '20 at 06:39
  • My concern is, the data is a sentences (["Jokowi is the President of Indonesia", "I'm with Michael here", ...] and the label is in the token-level (so the y is a label sequence [['B-Person', 'O', 'O', 'O', 'O', 'B-Country'], ['O, 'O', 'B-Person', O']]. If I do exactly like that, every unique sequence will be treated as a new class, so it doesn't work on this one. – Rifo Genadi Oct 16 '20 at 02:20
  • Well this seems quite more complicated then. Also you didn't mention that you have a multioutput problem. Could you ask a new question for that? @RifoGenadim Perhaps with a more complete example, and a clearer explanation – yatu Oct 16 '20 at 06:13
0

You can explore StratifiedShuffleSplit available in Scikit learn library.

Praks
  • 67
  • 1
  • 1
  • 4