1

I have made a training job on AWS Sagemaker and it runs well - reads from an s3 location and stores model checkpoints as intended in s3. Now, I need to trigger this trigger job with specified parameters (s3 location having data for eg.) from a website (via API gateway). The very first idea was to make a lambda function that gets called from an API call and it training job using the Sagemaker API:

HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

# staarting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

But, AWS lambda has a max runtime of 15 mins which is less than the training time required. I was wondering if there is a serverless way of doing the same thing? Is AWS step function any different from lambda in this regard?

  • I'm not sure but doesn't Lambda function only need to trigger sagemaker? What the Lambda do next? (I mean sagemaker itself can upload to S3.) – shimo Jan 30 '22 at 03:39
  • 1
    The flow is - lambda Starts a sagemaker training job -> on completion lambda deploys the trained model and sends the deployed model link back to the API as a response. So, lambda or any other backend (typically an ec2 instance) is supposed to be the central point of contact for external calls. – Sachin Saxena Jan 31 '22 at 04:11
  • I think you can create 2 lambdas. One is for starting the sagemaker job. Another lambda is for deploy the model and send back the link. 2nd lambda should be called from at the end of sagemaker with boto3 or called with stepfunction. – shimo Jan 31 '22 at 20:37
  • This sounds like a nice hack. Thanks – Sachin Saxena Feb 01 '22 at 10:04

1 Answers1

1

you can launch the training job asynchronously, either by adding wait=False in the fit(), or by using boto3 create_training_job. That way, you can launch the job from a Lambda, that will not need to wait for it to complete;

Olivier Cruchant
  • 3,747
  • 15
  • 18