0

I am trying to train a PyTorch model through SageMaker. I am running a script main.py (which I have posted a minimum working example of below) which calls a PyTorch Estimator. I have the code for training my model saved as a separate script, train.py, which is called by the entry_point parameter of the Estimator. These scripts are hosted on a EC2 instance in the same AWS region as my SageMaker domain.

When I try running this with instance_type = "ml.m5.4xlarge", it works ok. However, I am unable to debug any problems in train.py. Any bugs in that file simply give me the error: 'AlgorithmError: ExecuteUserScriptError', and will not allow me to set breakpoint() lines in train.py (encountering a breakpoint throws the above error).

Instead I am trying to run in local mode, which I believe does allow for breakpoints. However, when I reach estimator.fit(inputs), it hangs on that line indefinitely, giving no output. Any print statements that I put at the start of the main function in train.py are not reached. This is true no matter what code I put in train.py. It also did not throw an error when I had an illegal underscore in the base_job_name parameter of the estimator, which suggests that it does not even create the estimator instance.

Below is a minimum example which replicates the issue on my instance. Any help would be appreciated.

### File structure

main.py

customcode/
    |
    |_ train.py

### main.py

import sagemaker
from sagemaker.pytorch import PyTorch
import boto3

try:
    # When running on Studio.

    sess = sagemaker.session.Session()
    bucket = sess.default_bucket() 
    role = sagemaker.get_execution_role()

except ValueError:
    # When running from EC2 or local machine.

    print('Performing manual setup.')
    bucket = 'MY-BUCKET-NAME'
    region = 'us-east-2'
    role = 'arn:aws:iam::MY-ACCOUNT-NUMBER:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXX'

    iam = boto3.client("iam")
    sagemaker_client = boto3.client("sagemaker")

    boto3.setup_default_session(region_name=region, profile_name="default")
    sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)

hyperparameters = {'epochs': 10}
inputs = {'data': f's3://{bucket}/features'}

train_instance_type = 'local'

hosted_estimator = PyTorch(
    source_dir='customcode',
    entry_point='train.py',
    instance_type=train_instance_type,
    instance_count=1,
    hyperparameters=hyperparameters,
    role=role,
    base_job_name='mwe-train',
    framework_version='1.12',
    py_version='py38',
    input_mode='FastFile',
)

hosted_estimator.fit(inputs) # This is the line that freezes
### train.py

def main():
    breakpoint() # Throws an error in non-local mode.
    return 

if __name__ == '__main__':
    print('Reached') # Never reached in local mode.
    main()
Scott Vinay
  • 171
  • 1
  • 1
  • 6
  • Have you seen the logs on CloudWatch? Because it is very probably that the problem is made explicit there. – Giuseppe La Gualano Nov 03 '22 at 06:45
  • @GiuseppeLaGualano Unfortunately there are no CloudWatch logs for these jobs. I expect this is because it is in local mode. The attempts I made in non-local mode do have logs recorded. – Scott Vinay Nov 03 '22 at 06:58

0 Answers0