The max_steps
argument of TrainingArguments is num_rows_in_train / per_device_train_batch_size * num_train_epochs
when using streaming datasets of Huggingface?
- num_rows_in_train is total number of records in the training dataset
- per_device_train_batch_size is the batch size
- num_train_epochs is the number of epochs to run
As in Streaming dataset into Trainer: does not implement len, max_steps has to be specified, training with a streaming dataset requires max_steps
instead of num_train_epochs
.
According to the documents, it is set to the total number of training steps which should be number of total mini-batches.
If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.
For a small dataset of 2048 rows in the train split, set the training arguments as below.
training_args = TrainingArguments(
output_dir="bloom_finetuned",
max_steps=2048 * 3,
num_train_epochs=3,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
learning_rate=2e-5,
weight_decay=0.01,
fp16=True,
no_cuda=False,
evaluation_strategy="epoch",
save_strategy="epoch",
)
However, the training shows huge number of epochs.
***** Running training *****
Num examples = 6,144
Num Epochs = 9,223,372,036,854,775,807 <-----
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 6,144
Number of trainable parameters = 559,214,592