Highest Voted 'distributed-training' Questions

0

votes

2 answers

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [640]] is at version 4;

I want to use pytorch DistributedDataParallel for adversarial training. The loss function is trades.The code can run in DataParallel mode. But in DistributedDataParallel mode, I got this error. When I change the loss to AT, it can run successfully.…

asked Apr 30 '21 at 07:35

shudong

1
2

0

votes

1 answer

How to use model subclassing in Keras?

Having the following model written in the sequential API: config = { 'learning_rate': 0.001, 'lstm_neurons':32, 'lstm_activation':'tanh', 'dropout_rate': 0.08, 'batch_size': 128, 'dense_layers':[ {'neurons': 32,…

tensorflow machine-learning keras subclassing distributed-training

asked Mar 15 '21 at 11:02

Shlomi Schwartz

8,693
29
109
186

0

votes

1 answer

Dynamic PS-Worker Scheme Cannot Share Parameters in Cluster Propagation Mode

I'm trying to build a scalable distributed training system with a ps-worker scheme. In this scheme, every PS has information about all the PSs, and the number of PS stays constant. As for every worker, it only knows itself and all PS. Using the…

tensorflow distributed-training

asked May 11 '20 at 21:31

RBTOppenheimer

1
2

0

votes

1 answer

Does `tf.distribute.MirroredStrategy` have an impact on training outcome?

I don't understand if the MirroredStrategy has any impact on training outcome. By that, I mean: Is the model trained on a single device the same as a model trained on multiple devices? I think it should be the same model, because it's just a…

python tensorflow distributed-training

asked Apr 06 '20 at 15:32

Domi W

574
10
15

0

votes

1 answer

How are you getting trained in light of tech conferences getting cancelled?

Just helping figure out how to keep software engineers at my company trained. How are you getting trained in light of working from home and / or tech conferences getting cancelled for the foreseeable future?

distributed-training

asked Mar 28 '20 at 17:51

Jeff Hansen

11

-1

votes

2 answers

Can SageMaker distributed training be used for training non-deep learning models?

I am following this documentation page to understand SageMaker's distributed training feature. It says here that:- The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow,…

amazon-web-services machine-learning amazon-sagemaker distributed-training amazon-machine-learning

asked Sep 17 '22 at 04:05

juvchan

6,113
2
22
35

-1

votes

1 answer

Asynchronous Training with Ray

I want to be able to throw at some ray workers a lot of data collection tasks where a trainer is working concurrently and asynchronously on another cpu training on the collected data, the notion resembles this example from the docs:…

python multiprocessing ray distributed-training

asked Aug 12 '21 at 14:56

Gabizon

339
4
15

Questions tagged [distributed-training]

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [640]] is at version 4;

How to use model subclassing in Keras?

Dynamic PS-Worker Scheme Cannot Share Parameters in Cluster Propagation Mode

Does `tf.distribute.MirroredStrategy` have an impact on training outcome?

How are you getting trained in light of tech conferences getting cancelled?

Can SageMaker distributed training be used for training non-deep learning models?

Asynchronous Training with Ray