Questions tagged [distributed-training]

83 questions
1
vote
0 answers

[Pytorch]Error when using DistributedDataParallel in the broadcasting stage of initialization

I'm currently working on GroupFormer which used DistributedDataParallel for trainning. The error message is listed below and it shows that the error is caused by tensor size mismatch while broadcasting in the initialization stage. This error first…
1
vote
1 answer

Distributed training on PyTorch and Spot checkpoints in SageMaker

I'm building custom model on PyTorch and want to know how to implement snapshot logic for distributed training. If a model is trained on multiple spot instances and the model is implemented on BYO PyTorch image, how dpes Sagemaker know which…
1
vote
1 answer

Add Security groups in Amazon SageMaker for distributed training jobs

We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode). However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other.…
1
vote
0 answers

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

I am trying to utilize the multi-GPUs using Horovod for distributed training.  Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…
Ahmad
  • 645
  • 2
  • 6
  • 21
1
vote
1 answer

Distributed Training Terminology: Micro-batch and Per-Replica batch size

I am reading through the Sagemaker documentation on distributed training and confused on the terminology: Mini-Batch, Micro-batch and Per-replica batch size I understand that in data parallelism, there would be multiple copies of the model and each…
outlier229
  • 481
  • 1
  • 7
  • 18
1
vote
0 answers

keras.models.load_model does not work as expected within MirroredStrategy

I try to load a model within MirroredStategy. I find that the loaded model within MirroredStategy is not working correctly in that only one replica is found, while there are 4 visible devices specified actually. This does not happen for the model…
1
vote
1 answer

Sagemaker Notebook instance error AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'

I have a dask cluster active from dask.distributed import Client, progress client = Client() client When I try to encode my data I get the error: AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations' I encoded the data…
1
vote
1 answer

ROLLBACK_IN_PROGRESS status after creating dask-fargate-stack on AWS CloudFormation

I am following this guide to be able to use dask distributed on my sagemaker instance, so I can train my big data regression model, but when I create the stack, I get the status of ROLLBACK_IN_PROGRESS. How can I manually create the stack for dask…
1
vote
0 answers

distributed training with tensorflow on 'x' gpu makes loss 1/x

I was trying to run a model on multiple gpu with mirror strategy of tensorflow. I used a custom loss function like this: def mae(y_true, y_pred): # y_true, y_pred shape = (B, L) loss = tf.keras.metrics.mean_absolute_error(y_true, y_pred) …
steinum
  • 76
  • 3
1
vote
0 answers

How Tensorflow(2.0) distributed dataset manage data

I'm a newbie to Tensorflow. I have been learning how to use TensorFlow to train models in a distributed manner and I have access to multiple servers, each with multiple CPUs. Training mechanisms are clearly outlined in documentation and tutorials,…
smjfas
  • 43
  • 1
  • 1
  • 6
1
vote
1 answer

DistributedDataParallel with gpu device ID specified in PyTorch

I want to train my model through DistributedDataParallel on a single machine that has 8 GPUs. But I want to train my model on four specified GPUs with device IDs 4, 5, 6, 7. How to specify the GPU device ID for DistributedDataParallel? I think the…
Bipin
  • 53
  • 1
  • 8
1
vote
1 answer

how to know how many GPUs are used in pytorch?

The bash file I used to launch the training looks like this: CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch \ --nproc_per_node=2 train.py \ --batch_size 6 \ --other_args I found that the batch size of tensors in each GPU is acctually…
zheyuanWang
  • 1,158
  • 2
  • 16
  • 30
1
vote
0 answers

Training spaCy NER models on multiple GPUs (not just one)

I am training my NER model using the following code. Start of Code: def train_spacy(nlp, training_data, iterations): if "ner" not in nlp.pipe_names: ner = nlp.create_pipe('ner') nlp.add_pipe("ner", last = True) …
1
vote
0 answers

How to speed up TF model training? MultiWorkerMirroredStrategy looks a lot slower than non-distributed

Using the code in the Keras distributed training example; using TF 2.4.1. Following other docs: https://www.tensorflow.org/guide/distributed_training https://www.tensorflow.org/guide/distributed_training#multiworkermirroredstrategy On a single…
1
vote
0 answers

how to apply gradient clipping in TensorFlow when distributed training?

I would like to know how to apply gradient clipping in TensorFlow when distributed training. Here's my code: @lazy_property def optimize(self): # train_vars = ... optimizer = tf.train.AdamOptimizer(self._learning_rate) …
xuanjiu
  • 11
  • 2