How to write a configuration file to tell the AllenNLP trainer to randomly split dataset into train and dev

Question

The official document of AllenNLP suggests specifying "validation_data_path" in the configuration file, but what if one wants to construct a dataset from a single source and then randomly split it into train and validation datasets with a given ratio?

Does AllenNLP support this? I would greatly appreciate your comments.

score 1 · Accepted Answer · answered Mar 13 '21 at 01:26

AllenNLP does not have this functionality yet, but we are working on some stuff to get there.

In the meantime, here is how I did it for the VQAv2 reader: https://github.com/allenai/allennlp-models/blob/main/allennlp_models/vision/dataset_readers/vqav2.py#L354

This reader supports Python slicing syntax where you, for example, specify a data_path as "my_source_file[:1000]" to take the first 1000 instances from my_source_file. You can also supply multiple paths by setting data_path: ["file1", "file2[:1000]", "file3[1000-"]]. You can probably steal the top two blocks in that file (line 354 to 369) and put them into your own dataset reader to achieve the same result.

How to write a configuration file to tell the AllenNLP trainer to randomly split dataset into train and dev

1 Answers1