Based on examples, I am trying to train a tokenizer and a model for T5 for Persian. I use Google Colab pro, when I tried to run the following code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None # change to 100_000 works
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
print("len dataset:", len(dataset))
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("/content/drive/MyDrive/Pouramini/tokenizer.json")
It get stuck in train_from_iterator
because the size of dataset is large (input_sentence_size
is around 8M sentences)
How can I divide the dataset and run the code on each block and then merge them to a tokenizer output?