0

I am setting up a custom container (BYOC) for distributed training in TensorFlow 2.x using SageMaker Distributed Data Parallel Library (SMDDP), but got the following runtime error importing smdistributed.dataparallel.tensorflow

RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.

Has anyone run into this error before? Is there a good GitHub example to follow?

Here is the docker and requirements file.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04 ARG DEBIAN_FRONTEND=noninteractive
 
ENV PATH="/opt/ml/code:${PATH}"

ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

WORKDIR /opt/ml/code 
COPY requirements.txt ./ RUN pip3 install -r requirements.txt --no-cache --upgrade COPY . ./
 
ENV SAGEMAKER_PROGRAM main.py

requirements.txt

- configparser==5.0.2
- pandas==1.1.5
- Pillow==8.1.0
- boto3==1.17.3
- tqdm==4.62.3
juvchan
  • 6,113
  • 2
  • 22
  • 35

0 Answers0