0

I tryied running this code for hyper-parameter tuning of a bert model. I only have one GPU, so I adapted the code to make it run only one training at a time. This is the resulting code:

from datasets import load_dataset, load_metric
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          Trainer, TrainingArguments)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
dataset = load_dataset('glue', 'mrpc')
metric = load_metric('glue', 'mrpc')

def encode(examples):
    outputs = tokenizer(
        examples['sentence1'], examples['sentence2'], truncation=True)
    return outputs

encoded_dataset = dataset.map(encode, batched=True)

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        'distilbert-base-uncased', return_dict=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Evaluate during training and a bit more often
# than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
training_args = TrainingArguments(
    "test", evaluation_strategy="steps", eval_steps=500, disable_tqdm=True)
trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    model_init=model_init,
    compute_metrics=compute_metrics,
)

# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(
    direction="maximize", 
    backend="ray", 
    n_trials=10, # number of trials
    max_concurrent_trials=1,
)

Note that I added the max_concurrent_trials=1 parameter to do so. I run into an error:

ERROR serialization.py:354 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

This happens only on EVEN runs. So:

  1. Runs correctly
  2. Fails due to GPU not available
  3. Runs correctly
  4. Fails due to GPU not available
Current time: 2022-10-05 16:33:11 (running for 00:01:30.88)
Memory usage on this node: 93.0/2003.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/255 CPUs, 1.0/4 GPUs, 0.0/1746.65 GiB heap, 0.0/186.26 GiB objects
Result logdir: /home/***/ray_results/_objective_2022-10-05_16-31-40
Number of trials: 5/10 (2 ERROR, 1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+
| Trial name             | status     | loc                 |   learning_rate |   num_train_epochs |   per_device_train_... |     seed |   objective |
|------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------|
| _objective_6d123_00004 | RUNNING    | ******************* |     2.3102e-06  |                  5 |                      8 | 25.0818  |             |
| _objective_6d123_00000 | TERMINATED | ******************* |     5.61152e-06 |                  5 |                     64 |  8.15396 |     1.52889 |
| _objective_6d123_00002 | TERMINATED | ******************* |     8.28892e-06 |                  5 |                     16 | 24.4435  |     1.73279 |
| _objective_6d123_00001 | ERROR      | ******************* |     1.56207e-05 |                  2 |                     16 |  7.08379 |             |
| _objective_6d123_00003 | ERROR      | ******************* |     1.09943e-06 |                  2 |                      8 | 29.158   |             |
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+

I think that the second run fails because it starts before the GPU is made available by the first one. And I think so because of the fact that the third run starts and terminates correctly, while the fourth (as the second), fails for GPU unavailability.

How can I solve this? Should I somehow put a timeout between runs, to wait for the GPU to come available before starting the next one? Is there any way to do so with ray?

juuso
  • 612
  • 7
  • 26

0 Answers0