How to properly use ShardedByS3Key in distributed training scenario?

Question

Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key.

Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g., PT's DistributedSampler (should it be used at all?), or TF's tf.data-pipeline is necessary?

score 0 · Answer 1 · answered Oct 31 '22 at 14:37

According to the technique of "Sharded Data Parallelism":

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group, and performs gradient aggregation based on the AllReduce operation.

Then simply leave the default mode FullyReplicated in your TrainingInput's distribution param because parallelism does not occur at the level of data division on upstream instances but later on gpu.

See the guide on "How to apply Sharded data parallelism to your training work" or the full example notebook "Train GPT-2 with near-linear scaling using Sharded Data Parallelism technique in SageMaker Model Parallelism Library". In the last example it sets just the parameters step by step explicitly.

For example, you have to set at least the distribution dict params on PyTorch (or TensorFlow) estimator to enable the SageMaker distributed data parallelism:

{ "smdistributed": { "dataparallel": { "enabled": True } } }

How to properly use ShardedByS3Key in distributed training scenario?

1 Answers1