PyTorch: training with GPU gives worse error than training the same thing with CPU

Question

I have a next step prediction model on times series which is simply a GRU with a fully-connected layer on top of it. When I train it using CPU after 50 epochs I get a loss of 0.10 but when I train it with GPU the loss is 0.15 after 50 epochs. Doing more epochs doesnt really lower the losses in either cases.

Why is performance after training on CPU better than GPU?

I have tried changing the random seeds for both data and model, and these results are independent of the random seeds.

I have:

Python 3.6.2

PyTorch 0.3.0

CUDNN_MAJOR 7

CUDNN_MINOR 0

CUDNN_PATCHLEVEL 5

Edit:

I also use PyTorch's weight normalizaton torch.nn.utils.weight_norm on the GRU and on the fully-connected layer.

patapouf_ai · Accepted Answer · 2018-01-29T08:57:02.580

After trying many things I think I found the problem. Apparently the CUDNN libraries are sub-optimal in PyTorch. I don't know if it is a bug in PyTorch or a bug in CUDNN but doing

torch.backends.cudnn.enabled = False

solves the problem. With the above line, training with GPU or CPU gives the same loss at the same epoch.

Edit:

It seems that it is the interaction of weight normalization and CUDNN which results in things going wrong. If I remove weight normalization it works. If I remove CUDNN it works. It seems that only in combination they do not work in PyTorch.

PyTorch: training with GPU gives worse error than training the same thing with CPU

1 Answers1