I am trying to train a PyTorch model through SageMaker. I am running a script main.py
(which I have posted a minimum working example of below) which calls a PyTorch Estimator. I have the code for training my model saved as a separate script, train.py
, which is called by the entry_point
parameter of the Estimator. These scripts are hosted on a EC2 instance in the same AWS region as my SageMaker domain.
When I try running this with instance_type = "ml.m5.4xlarge"
, it works ok. However, I am unable to debug any problems in train.py
. Any bugs in that file simply give me the error: 'AlgorithmError: ExecuteUserScriptError'
, and will not allow me to set breakpoint()
lines in train.py
(encountering a breakpoint throws the above error).
Instead I am trying to run in local mode, which I believe does allow for breakpoints. However, when I reach estimator.fit(inputs)
, it hangs on that line indefinitely, giving no output. Any print statements that I put at the start of the main
function in train.py
are not reached. This is true no matter what code I put in train.py
. It also did not throw an error when I had an illegal underscore in the base_job_name
parameter of the estimator, which suggests that it does not even create the estimator instance.
Below is a minimum example which replicates the issue on my instance. Any help would be appreciated.
### File structure
main.py
customcode/
|
|_ train.py
### main.py
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3
try:
# When running on Studio.
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
except ValueError:
# When running from EC2 or local machine.
print('Performing manual setup.')
bucket = 'MY-BUCKET-NAME'
region = 'us-east-2'
role = 'arn:aws:iam::MY-ACCOUNT-NUMBER:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXX'
iam = boto3.client("iam")
sagemaker_client = boto3.client("sagemaker")
boto3.setup_default_session(region_name=region, profile_name="default")
sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)
hyperparameters = {'epochs': 10}
inputs = {'data': f's3://{bucket}/features'}
train_instance_type = 'local'
hosted_estimator = PyTorch(
source_dir='customcode',
entry_point='train.py',
instance_type=train_instance_type,
instance_count=1,
hyperparameters=hyperparameters,
role=role,
base_job_name='mwe-train',
framework_version='1.12',
py_version='py38',
input_mode='FastFile',
)
hosted_estimator.fit(inputs) # This is the line that freezes
### train.py
def main():
breakpoint() # Throws an error in non-local mode.
return
if __name__ == '__main__':
print('Reached') # Never reached in local mode.
main()