Hugginfface Trainer max_step to set for streaming dataset

Question

The max_steps argument of TrainingArguments is num_rows_in_train / per_device_train_batch_size * num_train_epochs when using streaming datasets of Huggingface?

num_rows_in_train is total number of records in the training dataset
per_device_train_batch_size is the batch size
num_train_epochs is the number of epochs to run

As in Streaming dataset into Trainer: does not implement len, max_steps has to be specified, training with a streaming dataset requires max_steps instead of num_train_epochs.

According to the documents, it is set to the total number of training steps which should be number of total mini-batches.

max_steps

If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

For a small dataset of 2048 rows in the train split, set the training arguments as below.

training_args = TrainingArguments(
    output_dir="bloom_finetuned",
    max_steps=2048 * 3,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=2e-5,
    weight_decay=0.01, 
    fp16=True,
    no_cuda=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

However, the training shows huge number of epochs.

***** Running training *****
  Num examples = 6,144
  Num Epochs = 9,223,372,036,854,775,807      <-----
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 6,144
  Number of trainable parameters = 559,214,592

Hugginfface Trainer max_step to set for streaming dataset

0 Answers0