Train Tokenizer with HuggingFace dataset

Question

I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

# You can customize how pre-tokenization (e.g., splitting into words) is done:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

# Then training your tokenizer on a set of files just takes two lines of codes:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

# Once your tokenizer is trained, encode any text with just one line:
output = tokenizer.encode("Hello, y'all! How are you  ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

However, the example is to load from three files: wiki.train.raw, wiki.valid.raw and wiki.test.raw. In my case, I am loading from wiki_split dataset. My code is as follow:

from tokenizers.trainers import BpeTrainer

def iterator_wiki(dataset):
    for txt in dataset:
        if type(txt) != float:
            yield txt

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(wiki_train), trainer=trainer)

The tokenizer.train_from_iterator() only accepts 1 dataset split, how can I use the validation and test split here?

score 1 · Accepted Answer · answered Feb 21 '23 at 10:29

Use the iterator which iterates over all the 3 datasets one after the another. Reference

Also note that each element in the wiki_split dataset is a dictionary. First element of train dataset is shown below:

{'complex_sentence': "'' New Day '' is a song by American hip hop recording artist 50 Cent , released on July 27 , 2012 , as an promotional single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) .",
 'simple_sentence_1': "'' New Day '' is a song by American hip hop recording artist 50 Cent . ",
 'simple_sentence_2': " The song was released on July 27 , 2012 , as a single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) ."}

Working Example

# Load the datasets
from datasets import load_dataset
train_dataset = load_dataset('wiki_split', split='train')
test_dataset = load_dataset('wiki_split', split='test')
val_dataset = load_dataset('wiki_split', split='validation')

# Iterator using the text form complex_sentence
def iterator_wiki(train_dataset, test_dataset, val_dataset):
  for mydataset in [train_dataset, test_dataset, val_dataset]:
    for i, data in enumerate(mydataset):
      if isinstance(data.get("complex_sentence", None), str):
        yield data["complex_sentence"]
 
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE())

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(
    train_dataset, test_dataset, val_dataset), trainer=trainer)

output = tokenizer.encode("Hello, y'all! How are you  ?")
print(output.tokens)

Output:

['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '', '?']

Train Tokenizer with HuggingFace dataset

1 Answers1