I want to run an RL training job on AWS SageMaker(script given below). But since the project is complex I was hoping to do a test run using SageMaker Local Mode (In my M1 MacBook Pro) before submitting to paid instances. However, I am struggling to make this local run successful even with a simple training task.
Now I did use Tensorflow-metal and Tensorflow-macos when running local training jobs(Without SageMaker). And I did not see anywhere I can specify this in the framework_version
and nor that I am sure "local_gpu"
which is the correct argument for a normal linux machine with GPU is exactly matching for Apple Silicon (M1 Pro).
I searched all over but I cannot find a case where this is addressed. (Very odd, am I doing something wrong? If so, please correct me.) If not and there's anyone who knows of a configuration, a docker image or an example properly done with M1 Pro please share.
I tried to run the following code. Which hangs after Logging in. (If you are trying to run the code, try with any simple training script as entry_point, and make sure to login with a similar code matching to your region using awscli with following command.
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
##main.py
import boto3
import sagemaker
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist
from sagemaker.tensorflow import TensorFlow
sess = sagemaker.Session()
#role = <'arn:aws:iam::0000000000000:role/CFN-SM-IM-Lambda-Catalog-sk-SageMakerExecutionRole-BlaBlaBla'> #KINDLY ADD YOUR ROLE HERE
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()
os.makedirs("./data", exist_ok = True)
np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)
# Train on local data. S3 URIs would work too.
training_input_path = 'file://data/training.npz'
validation_input_path = 'file://data/validation.npz'
# Store model locally. A S3 URI would work too.
output_path = 'file:///tmp/model/'
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
instance_count=1,
instance_type='local_gpu', # Train on the local CPU ('local_gpu' if it has a GPU)
framework_version='2.1.0',
py_version='py3',
hyperparameters={'epochs': 1},
output_path=output_path
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})