3

I want to run an RL training job on AWS SageMaker(script given below). But since the project is complex I was hoping to do a test run using SageMaker Local Mode (In my M1 MacBook Pro) before submitting to paid instances. However, I am struggling to make this local run successful even with a simple training task.

Now I did use Tensorflow-metal and Tensorflow-macos when running local training jobs(Without SageMaker). And I did not see anywhere I can specify this in the framework_version and nor that I am sure "local_gpu" which is the correct argument for a normal linux machine with GPU is exactly matching for Apple Silicon (M1 Pro).

I searched all over but I cannot find a case where this is addressed. (Very odd, am I doing something wrong? If so, please correct me.) If not and there's anyone who knows of a configuration, a docker image or an example properly done with M1 Pro please share.

I tried to run the following code. Which hangs after Logging in. (If you are trying to run the code, try with any simple training script as entry_point, and make sure to login with a similar code matching to your region using awscli with following command. aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

##main.py
import boto3
import sagemaker
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist
from sagemaker.tensorflow import TensorFlow

sess = sagemaker.Session() 
#role = <'arn:aws:iam::0000000000000:role/CFN-SM-IM-Lambda-Catalog-sk-SageMakerExecutionRole-BlaBlaBla'> #KINDLY ADD YOUR ROLE HERE

(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

os.makedirs("./data", exist_ok = True)

np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

# Train on local data. S3 URIs would work too.
training_input_path   = 'file://data/training.npz'
validation_input_path = 'file://data/validation.npz'

# Store model locally. A S3 URI would work too.
output_path           = 'file:///tmp/model/'

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
                          role=role,
                          instance_count=1, 
                          instance_type='local_gpu',   # Train on the local CPU ('local_gpu' if it has a GPU)
                          framework_version='2.1.0',
                          py_version='py3',
                          hyperparameters={'epochs': 1},
                          output_path=output_path
                         )

tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
spramuditha
  • 357
  • 2
  • 9
  • I don't think there's a 'local_gpu' option for instance type, you'd simply specify 'local'. Have you tried that, and does it work? – durga_sury Oct 17 '22 at 15:02
  • Hi, @durga_sury Thank you for the comment. And yes, I did try that ends up saying there is no GPU found. I believe this is because local_gpu is created to accommodate Nvidia GPUs. – spramuditha Oct 17 '22 at 22:50

1 Answers1

6

The prebuilt SageMaker Docker Images for Deep Learning doesn't have Arm based support yet. You can see Available Deep Learning Containers Images here.

The solution is to build your own Docker image and use it with SageMaker.

This is an example Dockerfile that uses miniconda to install TensorFlow dependencies:

FROM arm64v8/ubuntu

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         nginx \
         ca-certificates \
         gcc \
         linux-headers-generic \
         libc-dev

RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN chmod a+x Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN bash Miniconda3-py38_4.9.2-Linux-aarch64.sh -b
ENV PATH /root/miniconda3/bin/:$PATH

COPY ml-dependencies.yml ./
RUN conda env create -f ml-dependencies.yml

ENV PATH /root/miniconda3/envs/ml-dependencies/bin:$PATH

This is the the ml-dependencies.yml:

name: ml-dependencies
dependencies:
  - python=3.8
  - numpy
  - pandas
  - scikit-learn
  - tensorflow==2.8.2
  - pip
  - pip:
    - sagemaker-training

And this is how you'll run the trainin gusing SageMaker Script mode:

    image = 'sagemaker-tensorflow2-graviton-training-toolkit-local'

    california_housing_estimator = Estimator(
        image_uri=image,
        entry_point='california_housing_tf2.py',
        source_dir='code',
        role=DUMMY_IAM_ROLE,
        instance_count=1,
        instance_type='local',
        hyperparameters={'epochs': 10,
                         'batch_size': 64,
                         'learning_rate': 0.1})

    inputs = {'train': 'file://./data/train', 'test': 'file://./data/test'}
    california_housing_estimator.fit(inputs, logs=True)

You can find the full working sample code on the Amazon SageMaker Local Mode Examples GitHub repository here.

Eitan Sela
  • 61
  • 2
  • 1
    Thank you, I already stumbled upon this path. But it's really nice to have it validated. Appreciate it. Also, there are too many unknown, unanswered problems in the subject. I'd like to keep a thread open. :) – spramuditha Oct 21 '22 at 00:26