4

What is the difference between AWS Batch and Sagemaker Training Job when using it for running docker image for Machine Learning training?

ryfeus
  • 323
  • 2
  • 6

1 Answers1

6

Both services are implementation of CaaS aka Container-as-a-Service. It means that you don't have to manage clusters and can only define launch configuration. And both services can be used for running training jobs in this regard once you have your docker image ready. Notable differences are:

  1. [Operational complexity] AWS Batch has higher operational complexity then SageMaker training jobs. With the latter you don't need to provision any infrastructure - at most the role that is generated automatically. With the former you would need to deploy infrastructure, although you would definitely have a more refined control over it.
  2. [Architecture] AWS Batch is less pure CaaS and closer to a managed cluster. It has a job queue and scales cluster based on job queue size while also places jobs on the machines. SageMaker training jobs starts VM per job and VM itself is abstracted from the user. So for example you could SSH into AWS Batch instance, but not SageMaker one.
  3. [Docker image] SageMaker would require heavier customization of the docker container to make it work, but it does it so that you don't have to implement it yourself for thing like - passing hyperparameters, gathering metrics, and saving the model. AWS Batch just runs the container - so any associated business logic has to be implemented by the developer.
  4. [Cost] Both AWS Batch and SageMaker training jobs are free aka you only pay for underlying infrastructure which was used. SageMaker training jobs uses ml.* instances which are ~10-20% more expensive then their on-demand counterparts (e.g. p2.xlarge costs $0.9 per hour and ml.p2.xlarge costs $1.125 per hour). Both services have a way of running the spot instances which would have lower cost.

So to summarize - AWS Batch is a more generalized and customizable tool, while SageMaker Training Jobs is a more focused one with more prebuilt features.

ryfeus
  • 323
  • 2
  • 6
  • 1
    Nice breakdown. To note that this solution allows you to SSH to SageMaker jobs: https://github.com/aws-samples/sagemaker-ssh-helper – Gili Nachum Nov 16 '22 at 11:52