0

I am currently using SageMaker to train BERT and trying to improve the BERT training time. I use PyTorch and Huggingface on AWS g4dn.12xlarge instance type.

However when I run parallel training it is far from achieving linear improvement. I'm looking for some hints on distributed training to improve the BERT training time in SageMaker.

juvchan
  • 6,113
  • 2
  • 22
  • 35

1 Answers1

0

You can use SageMaker Distributed Data Parallel (SMDDP) to run training on a multinode and multigpu setup. Please refer to the below links for BERT based training example

https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb

This is with HuggingFace - https://github.com/aruncs2005/pytorch-ddp-sm-example

please refer to the documentation here for step by step instructions.

https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html

Arun Lokanatha
  • 290
  • 1
  • 4