I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())
# You can customize how pre-tokenization (e.g., splitting into words) is done:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
# Then training your tokenizer on a set of files just takes two lines of codes:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)
# Once your tokenizer is trained, encode any text with just one line:
output = tokenizer.encode("Hello, y'all! How are you ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
However, the example is to load from three files: wiki.train.raw
, wiki.valid.raw
and wiki.test.raw
. In my case, I am loading from wiki_split
dataset. My code is as follow:
from tokenizers.trainers import BpeTrainer
def iterator_wiki(dataset):
for txt in dataset:
if type(txt) != float:
yield txt
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(wiki_train), trainer=trainer)
The tokenizer.train_from_iterator()
only accepts 1 dataset split, how can I use the validation and test split here?