I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:
“Failure reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.”
Their code handles multi gpu usage and currently works well on their machine outside of AWS. Do you have any documentation that you can point me to help them address the problem? They are currently using PyTorch for all of their model development.