Questions tagged [distributed-training]
83 questions
1
vote
0 answers
[Pytorch]Error when using DistributedDataParallel in the broadcasting stage of initialization
I'm currently working on GroupFormer which used DistributedDataParallel for trainning. The error message is listed below and it shows that the error is caused by tensor size mismatch while broadcasting in the initialization stage.
This error first…

jasonWu
- 11
- 2
1
vote
1 answer
Distributed training on PyTorch and Spot checkpoints in SageMaker
I'm building custom model on PyTorch and want to know how to implement snapshot logic for distributed training.
If a model is trained on multiple spot instances and the model is implemented on BYO PyTorch image, how dpes Sagemaker know which…

juvchan
- 6,113
- 2
- 22
- 35
1
vote
1 answer
Add Security groups in Amazon SageMaker for distributed training jobs
We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode).
However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other.…

Philipp Schmid
- 126
- 7
1
vote
0 answers
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…

Ahmad
- 645
- 2
- 6
- 21
1
vote
1 answer
Distributed Training Terminology: Micro-batch and Per-Replica batch size
I am reading through the Sagemaker documentation on distributed training and confused on the terminology:
Mini-Batch, Micro-batch and Per-replica batch size
I understand that in data parallelism, there would be multiple copies of the model and each…

outlier229
- 481
- 1
- 7
- 18
1
vote
0 answers
keras.models.load_model does not work as expected within MirroredStrategy
I try to load a model within MirroredStategy. I find that the loaded model within MirroredStategy is not working correctly in that only one replica is found, while there are 4 visible devices specified actually. This does not happen for the model…

Kingsley Liu
- 11
- 1
1
vote
1 answer
Sagemaker Notebook instance error AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'
I have a dask cluster active
from dask.distributed import Client, progress
client = Client()
client
When I try to encode my data I get the error:
AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'
I encoded the data…

Alejandro
- 119
- 7
1
vote
1 answer
ROLLBACK_IN_PROGRESS status after creating dask-fargate-stack on AWS CloudFormation
I am following this guide to be able to use dask distributed on my sagemaker instance, so I can train my big data regression model, but when I create the stack, I get the status of ROLLBACK_IN_PROGRESS.
How can I manually create the stack for dask…

Alejandro
- 119
- 7
1
vote
0 answers
distributed training with tensorflow on 'x' gpu makes loss 1/x
I was trying to run a model on multiple gpu with mirror strategy of tensorflow.
I used a custom loss function like this:
def mae(y_true, y_pred):
# y_true, y_pred shape = (B, L)
loss = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
…

steinum
- 76
- 3
1
vote
0 answers
How Tensorflow(2.0) distributed dataset manage data
I'm a newbie to Tensorflow. I have been learning how to use TensorFlow to train models in a distributed manner and I have access to multiple servers, each with multiple CPUs.
Training mechanisms are clearly outlined in documentation and tutorials,…

smjfas
- 43
- 1
- 1
- 6
1
vote
1 answer
DistributedDataParallel with gpu device ID specified in PyTorch
I want to train my model through DistributedDataParallel on a single machine that has 8 GPUs. But I want to train my model on four specified GPUs with device IDs 4, 5, 6, 7.
How to specify the GPU device ID for DistributedDataParallel?
I think the…

Bipin
- 53
- 1
- 8
1
vote
1 answer
how to know how many GPUs are used in pytorch?
The bash file I used to launch the training looks like this:
CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch \
--nproc_per_node=2 train.py \
--batch_size 6 \
--other_args
I found that the batch size of tensors in each GPU is acctually…

zheyuanWang
- 1,158
- 2
- 16
- 30
1
vote
0 answers
Training spaCy NER models on multiple GPUs (not just one)
I am training my NER model using the following code.
Start of Code:
def train_spacy(nlp, training_data, iterations):
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe("ner", last = True)
…

Julia Penfield
- 53
- 7
1
vote
0 answers
How to speed up TF model training? MultiWorkerMirroredStrategy looks a lot slower than non-distributed
Using the code in the Keras distributed training example; using TF 2.4.1.
Following other docs:
https://www.tensorflow.org/guide/distributed_training
https://www.tensorflow.org/guide/distributed_training#multiworkermirroredstrategy
On a single…

Dmitry Goldenberg
- 197
- 3
- 7
1
vote
0 answers
how to apply gradient clipping in TensorFlow when distributed training?
I would like to know how to apply gradient clipping in TensorFlow when distributed training.
Here's my code:
@lazy_property
def optimize(self):
# train_vars = ...
optimizer = tf.train.AdamOptimizer(self._learning_rate)
…

xuanjiu
- 11
- 2