I am setting up a custom container (BYOC) for distributed training in TensorFlow 2.x using SageMaker Distributed Data Parallel Library (SMDDP), but got the following runtime error importing smdistributed.dataparallel.tensorflow
RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.
Has anyone run into this error before? Is there a good GitHub example to follow?
Here is the docker and requirements file.
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04 ARG DEBIAN_FRONTEND=noninteractive
ENV PATH="/opt/ml/code:${PATH}"
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
WORKDIR /opt/ml/code
COPY requirements.txt ./ RUN pip3 install -r requirements.txt --no-cache --upgrade COPY . ./
ENV SAGEMAKER_PROGRAM main.py
requirements.txt
- configparser==5.0.2
- pandas==1.1.5
- Pillow==8.1.0
- boto3==1.17.3
- tqdm==4.62.3