Ray hyperparameter tuning fails even runs, due to CUDA GPU unavailable

Question

I tryied running this code for hyper-parameter tuning of a bert model. I only have one GPU, so I adapted the code to make it run only one training at a time. This is the resulting code:

from datasets import load_dataset, load_metric
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          Trainer, TrainingArguments)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
dataset = load_dataset('glue', 'mrpc')
metric = load_metric('glue', 'mrpc')

def encode(examples):
    outputs = tokenizer(
        examples['sentence1'], examples['sentence2'], truncation=True)
    return outputs

encoded_dataset = dataset.map(encode, batched=True)

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        'distilbert-base-uncased', return_dict=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Evaluate during training and a bit more often
# than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
training_args = TrainingArguments(
    "test", evaluation_strategy="steps", eval_steps=500, disable_tqdm=True)
trainer = Trainer(
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    model_init=model_init,
    compute_metrics=compute_metrics,
)

# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(
    direction="maximize", 
    backend="ray", 
    n_trials=10, # number of trials
    max_concurrent_trials=1,
)

Note that I added the max_concurrent_trials=1 parameter to do so. I run into an error:

ERROR serialization.py:354 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

This happens only on EVEN runs. So:

Runs correctly
Fails due to GPU not available
Runs correctly
Fails due to GPU not available

Current time: 2022-10-05 16:33:11 (running for 00:01:30.88)
Memory usage on this node: 93.0/2003.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/255 CPUs, 1.0/4 GPUs, 0.0/1746.65 GiB heap, 0.0/186.26 GiB objects
Result logdir: /home/***/ray_results/_objective_2022-10-05_16-31-40
Number of trials: 5/10 (2 ERROR, 1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+
| Trial name             | status     | loc                 |   learning_rate |   num_train_epochs |   per_device_train_... |     seed |   objective |
|------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------|
| _objective_6d123_00004 | RUNNING    | ******************* |     2.3102e-06  |                  5 |                      8 | 25.0818  |             |
| _objective_6d123_00000 | TERMINATED | ******************* |     5.61152e-06 |                  5 |                     64 |  8.15396 |     1.52889 |
| _objective_6d123_00002 | TERMINATED | ******************* |     8.28892e-06 |                  5 |                     16 | 24.4435  |     1.73279 |
| _objective_6d123_00001 | ERROR      | ******************* |     1.56207e-05 |                  2 |                     16 |  7.08379 |             |
| _objective_6d123_00003 | ERROR      | ******************* |     1.09943e-06 |                  2 |                      8 | 29.158   |             |
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+

I think that the second run fails because it starts before the GPU is made available by the first one. And I think so because of the fact that the third run starts and terminates correctly, while the fourth (as the second), fails for GPU unavailability.

How can I solve this? Should I somehow put a timeout between runs, to wait for the GPU to come available before starting the next one? Is there any way to do so with ray?

Ray hyperparameter tuning fails even runs, due to CUDA GPU unavailable

0 Answers0