Highest Voted 'distributed-training' Questions

1

vote

0 answers

[Pytorch]Error when using DistributedDataParallel in the broadcasting stage of initialization

I'm currently working on GroupFormer which used DistributedDataParallel for trainning. The error message is listed below and it shows that the error is caused by tensor size mismatch while broadcasting in the initialization stage. This error first…

asked Dec 01 '22 at 04:00

jasonWu

11
2

1

vote

1 answer

Distributed training on PyTorch and Spot checkpoints in SageMaker

I'm building custom model on PyTorch and want to know how to implement snapshot logic for distributed training. If a model is trained on multiple spot instances and the model is implemented on BYO PyTorch image, how dpes Sagemaker know which…

pytorch amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Oct 15 '22 at 00:54

juvchan

6,113
2
22
35

1

vote

1 answer

Add Security groups in Amazon SageMaker for distributed training jobs

We would like to enforce specific security groups to be set on the SageMaker training jobs (XGBoost in script mode). However, distributed training, in this case, won’t work out of the box, since the containers need to communicate with each other.…

amazon-web-services xgboost amazon-sagemaker distributed-training amz-sagemaker-distributed-training

asked Sep 09 '22 at 14:20

Philipp Schmid

126
7

1

vote

0 answers

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…

python tensorflow distributed-training horovod

asked Aug 14 '22 at 11:32

Ahmad

645
2
6
21

1

vote

1 answer

Distributed Training Terminology: Micro-batch and Per-Replica batch size

I am reading through the Sagemaker documentation on distributed training and confused on the terminology: Mini-Batch, Micro-batch and Per-replica batch size I understand that in data parallelism, there would be multiple copies of the model and each…

amazon-sagemaker distributed-training

asked Aug 04 '22 at 20:24

outlier229

481
1
7
18

1

vote

0 answers

keras.models.load_model does not work as expected within MirroredStrategy

I try to load a model within MirroredStategy. I find that the loaded model within MirroredStategy is not working correctly in that only one replica is found, while there are 4 visible devices specified actually. This does not happen for the model…

tensorflow keras distributed-training

asked Jul 14 '22 at 04:14

Kingsley Liu

11
1

1

vote

1 answer

Sagemaker Notebook instance error AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'

I have a dask cluster active from dask.distributed import Client, progress client = Client() client When I try to encode my data I get the error: AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations' I encoded the data…

dask amazon-sagemaker dask-distributed dask-ml distributed-training

asked Jun 15 '22 at 16:03

Alejandro

119
7

1

vote

1 answer

ROLLBACK_IN_PROGRESS status after creating dask-fargate-stack on AWS CloudFormation

I am following this guide to be able to use dask distributed on my sagemaker instance, so I can train my big data regression model, but when I create the stack, I get the status of ROLLBACK_IN_PROGRESS. How can I manually create the stack for dask…

dask distributed-computing amazon-sagemaker dask-distributed distributed-training

asked Jun 13 '22 at 11:00

Alejandro

119
7

1

vote

0 answers

distributed training with tensorflow on 'x' gpu makes loss 1/x

I was trying to run a model on multiple gpu with mirror strategy of tensorflow. I used a custom loss function like this: def mae(y_true, y_pred): # y_true, y_pred shape = (B, L) loss = tf.keras.metrics.mean_absolute_error(y_true, y_pred) …

python tensorflow keras distributed-training

asked Mar 21 '22 at 10:03

steinum

76
3

1

vote

0 answers

How Tensorflow(2.0) distributed dataset manage data

I'm a newbie to Tensorflow. I have been learning how to use TensorFlow to train models in a distributed manner and I have access to multiple servers, each with multiple CPUs. Training mechanisms are clearly outlined in documentation and tutorials,…

python tensorflow2.0 distributed-training

asked Jan 22 '22 at 09:29

smjfas

43
1
1
6

1

vote

1 answer

DistributedDataParallel with gpu device ID specified in PyTorch

I want to train my model through DistributedDataParallel on a single machine that has 8 GPUs. But I want to train my model on four specified GPUs with device IDs 4, 5, 6, 7. How to specify the GPU device ID for DistributedDataParallel? I think the…

pytorch multi-gpu distributed-training

asked Oct 25 '21 at 05:18

Bipin

53
1
8

1

vote

1 answer

how to know how many GPUs are used in pytorch?

The bash file I used to launch the training looks like this: CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch \ --nproc_per_node=2 train.py \ --batch_size 6 \ --other_args I found that the batch size of tensors in each GPU is acctually…

python pytorch distributed-training

asked Aug 16 '21 at 12:35

zheyuanWang

1,158
2
16
30

1

vote

0 answers

Training spaCy NER models on multiple GPUs (not just one)

I am training my NER model using the following code. Start of Code: def train_spacy(nlp, training_data, iterations): if "ner" not in nlp.pipe_names: ner = nlp.create_pipe('ner') nlp.add_pipe("ner", last = True) …

spacy named-entity-recognition multi-gpu custom-training distributed-training

asked May 14 '21 at 19:33

Julia Penfield

53
7

1

vote

0 answers

How to speed up TF model training? MultiWorkerMirroredStrategy looks a lot slower than non-distributed

Using the code in the Keras distributed training example; using TF 2.4.1. Following other docs: https://www.tensorflow.org/guide/distributed_training https://www.tensorflow.org/guide/distributed_training#multiworkermirroredstrategy On a single…

python tensorflow keras distributed-training

asked Apr 06 '21 at 22:41

Dmitry Goldenberg

197
3
7

1

vote

0 answers

how to apply gradient clipping in TensorFlow when distributed training?

I would like to know how to apply gradient clipping in TensorFlow when distributed training. Here's my code: @lazy_property def optimize(self): # train_vars = ... optimizer = tf.train.AdamOptimizer(self._learning_rate) …

python tensorflow distributed-training

asked Jan 29 '21 at 06:33

xuanjiu

11
2

Questions tagged [distributed-training]