Training PyTorch models on different machines leads to different results

Question

I am training the same model on two different machines, but the trained models are not identical. I have taken the following measures to ensure reproducibility:

# set random number 
random.seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)

# set the cudnn
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True

# set data loader work threads to be 0
DataLoader(dataset, num_works=0)

When I train the same model multiple times on the same machine, the trained model is always the same. However, the trained models on two different machines are not the same. Is this normal? Are there any other tricks I can employ?

What layers do you use in your model? Besides the relevant answer of @iacob, specifc layers may operate in a non-deterministic way. — Shir, May 13 '21 at 08:16

iacob · Accepted Answer · 2021-05-13T09:10:14.620

There are a number of areas that could additionally introduce randomness e.g:

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

CUDA convolution determinism

While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True) or torch.backends.cudnn.deterministic = True is set. The latter setting controls only this behavior, unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.

CUDA RNN and LSTM

In some versions of CUDA, RNNs and LSTM networks may have non-deterministic behavior. See torch.nn.RNN() and torch.nn.LSTM() for details and workarounds.

DataLoader

DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() to preserve reproducibility:

https://pytorch.org/docs/stable/notes/randomness.html

Training PyTorch models on different machines leads to different results

1 Answers1

PyTorch random number generator

CUDA convolution determinism

CUDA RNN and LSTM

`DataLoader`

Linked