1

I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:

“Failure reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.”

Their code handles multi gpu usage and currently works well on their machine outside of AWS. Do you have any documentation that you can point me to help them address the problem? They are currently using PyTorch for all of their model development.

  • Are they using SageMaker Hyperparameter tuning job. Also do you know how they are running multi GPU training is it either through Pytorch Distributed data parallel. You can refer our Examples here to run multi GPU distributed training jobs https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb – Arun Lokanatha Sep 09 '22 at 17:12

1 Answers1

1

Looks like they are running Hyperparameter Optimization (HPO) on Sagemaker and no metrics is being emitted by their code that allows HPO to tune. It is a problem with how they specify regular expression objective metric, for more details see SageMaker Estimator Metrics Definitions.

Essentially use a tool like https://regex101.com to validate the regex they use extracts the objective number from their training logs.

juvchan
  • 6,113
  • 2
  • 22
  • 35