Key error when feeding the training corpus to the train_new_from_iterator method

Question

I am following this tutorial here: https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb

So, using this code, I add my custom dataset:

from datasets import load_dataset
dataset = load_dataset('csv', data_files=['/content/drive/MyDrive/mydata.csv'])

Then, I use this code to take a look at the dataset:

dataset

Access an element:

dataset['train'][1]

Access a slice directory:

dataset['train'][:5]

After executing the above code successfully, I try to execute this here:

new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

However, I get this error:

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"

How do I fix this?

I am trying to train my own tokenizer, and this seems to be an issue.

Any help would be appreciated!

score 0 · Answer 1 · answered Dec 14 '21 at 19:59

Simply as said in error trace you shouldspecify the split when loading the dataset as follows:

dataset = load_dataset('csv', data_files=['/content/drive/MyDrive/mydata.csv'], split='train')

Or writing the batch_iterator as follows:

def batch_iterator():
    for i in range(0, len(dataset['train']), batch_size):
        yield dataset['train'][i : i + batch_size]["text"]

Key error when feeding the training corpus to the train_new_from_iterator method

1 Answers1