I have made a training job on AWS Sagemaker and it runs well - reads from an s3 location and stores model checkpoints as intended in s3. Now, I need to trigger this trigger job with specified parameters (s3 location having data for eg.) from a website (via API gateway). The very first idea was to make a lambda function that gets called from an API call and it training job using the Sagemaker API:
HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters)
# staarting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
But, AWS lambda has a max runtime of 15 mins which is less than the training time required. I was wondering if there is a serverless way of doing the same thing? Is AWS step function any different from lambda in this regard?