I tryied running this code for hyper-parameter tuning of a bert model. I only have one GPU, so I adapted the code to make it run only one training at a time. This is the resulting code:
from datasets import load_dataset, load_metric
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
Trainer, TrainingArguments)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
dataset = load_dataset('glue', 'mrpc')
metric = load_metric('glue', 'mrpc')
def encode(examples):
outputs = tokenizer(
examples['sentence1'], examples['sentence2'], truncation=True)
return outputs
encoded_dataset = dataset.map(encode, batched=True)
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
'distilbert-base-uncased', return_dict=True)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = predictions.argmax(axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Evaluate during training and a bit more often
# than the default to be able to prune bad trials early.
# Disabling tqdm is a matter of preference.
training_args = TrainingArguments(
"test", evaluation_strategy="steps", eval_steps=500, disable_tqdm=True)
trainer = Trainer(
args=training_args,
tokenizer=tokenizer,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["validation"],
model_init=model_init,
compute_metrics=compute_metrics,
)
# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=10, # number of trials
max_concurrent_trials=1,
)
Note that I added the max_concurrent_trials=1
parameter to do so.
I run into an error:
ERROR serialization.py:354 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
This happens only on EVEN runs. So:
- Runs correctly
- Fails due to GPU not available
- Runs correctly
- Fails due to GPU not available
Current time: 2022-10-05 16:33:11 (running for 00:01:30.88)
Memory usage on this node: 93.0/2003.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/255 CPUs, 1.0/4 GPUs, 0.0/1746.65 GiB heap, 0.0/186.26 GiB objects
Result logdir: /home/***/ray_results/_objective_2022-10-05_16-31-40
Number of trials: 5/10 (2 ERROR, 1 RUNNING, 2 TERMINATED)
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+
| Trial name | status | loc | learning_rate | num_train_epochs | per_device_train_... | seed | objective |
|------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------|
| _objective_6d123_00004 | RUNNING | ******************* | 2.3102e-06 | 5 | 8 | 25.0818 | |
| _objective_6d123_00000 | TERMINATED | ******************* | 5.61152e-06 | 5 | 64 | 8.15396 | 1.52889 |
| _objective_6d123_00002 | TERMINATED | ******************* | 8.28892e-06 | 5 | 16 | 24.4435 | 1.73279 |
| _objective_6d123_00001 | ERROR | ******************* | 1.56207e-05 | 2 | 16 | 7.08379 | |
| _objective_6d123_00003 | ERROR | ******************* | 1.09943e-06 | 2 | 8 | 29.158 | |
+------------------------+------------+---------------------+-----------------+--------------------+------------------------+----------+-------------+
I think that the second run fails because it starts before the GPU is made available by the first one. And I think so because of the fact that the third run starts and terminates correctly, while the fourth (as the second), fails for GPU unavailability.
How can I solve this?
Should I somehow put a timeout between runs, to wait for the GPU to come available before starting the next one? Is there any way to do so with ray
?