Questions tagged [distributed-training]
83 questions
0
votes
2 answers
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [640]] is at version 4;
I want to use pytorch DistributedDataParallel for adversarial training. The loss function is trades.The code can run in DataParallel mode. But in DistributedDataParallel mode, I got this error.
When I change the loss to AT, it can run successfully.…

shudong
- 1
- 2
0
votes
1 answer
How to use model subclassing in Keras?
Having the following model written in the sequential API:
config = {
'learning_rate': 0.001,
'lstm_neurons':32,
'lstm_activation':'tanh',
'dropout_rate': 0.08,
'batch_size': 128,
'dense_layers':[
{'neurons': 32,…

Shlomi Schwartz
- 8,693
- 29
- 109
- 186
0
votes
1 answer
Dynamic PS-Worker Scheme Cannot Share Parameters in Cluster Propagation Mode
I'm trying to build a scalable distributed training system with a ps-worker scheme. In this scheme, every PS has information about all the PSs, and the number of PS stays constant. As for every worker, it only knows itself and all PS.
Using the…

RBTOppenheimer
- 1
- 2
0
votes
1 answer
Does `tf.distribute.MirroredStrategy` have an impact on training outcome?
I don't understand if the MirroredStrategy has any impact on training outcome.
By that, I mean: Is the model trained on a single device the same as a model trained on multiple devices?
I think it should be the same model, because it's just a…

Domi W
- 574
- 10
- 15
0
votes
1 answer
How are you getting trained in light of tech conferences getting cancelled?
Just helping figure out how to keep software engineers at my company trained. How are you getting trained in light of working from home and / or tech conferences getting cancelled for the foreseeable future?

Jeff Hansen
- 11
-1
votes
2 answers
Can SageMaker distributed training be used for training non-deep learning models?
I am following this documentation page to understand SageMaker's distributed training feature.
It says here that:-
The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow,…

juvchan
- 6,113
- 2
- 22
- 35
-1
votes
1 answer
Asynchronous Training with Ray
I want to be able to throw at some ray workers a lot of data collection tasks where a trainer is working concurrently and asynchronously on another cpu training on the collected data, the notion resembles this example from the docs:…

Gabizon
- 339
- 4
- 15