This error happens at the point the SDK tries to look up an eligible container image and finds that (unlike other frameworks like base PyTorch), HF only offers CUDA-enabled DLC images.
Maybe (I haven't checked but would be interested to know), you could actually run the GPU image locally in Docker without issue? You could try explicitly specifying the image_uri
parameter of your Estimator with the GPU image and hoping it runs okay:
train_image_uri = sagemaker.image_uris.retrieve(
framework="huggingface",
region=your_region, # e.g. "us-east-1"
instance_type="ml.p3.2xlarge", # -> GPU image
py_version="py38",
version="4.17",
base_framework_version="pytorch1.10",
image_scope="training",
)
estimator = HuggingFace(
image_uri=train_image_uri,
instance_type="local",
...
)
(For supported combinations can refer to the SageMaker SDK config file).
Alternatively, you could probably just use the PyTorch framework for your local development (or TensorFlow, if you're using HuggingFace TF) - and include a requirements.txt
file in your script bundle to install HF libraries at the version(s) you need. For example:
# requirements.txt in the same source_dir folder as your train.py script
transformers[sklearn,sentencepiece]==4.17.0
datasets==1.18.4
This would result in your local test environment being slightly different from the true training job environment, but hopefully close enough to be useful debugging initial functional issues in your code before using SageMaker for the actual training attempts.