I have a next step prediction model on times series which is simply a GRU with a fully-connected layer on top of it. When I train it using CPU after 50 epochs I get a loss of 0.10 but when I train it with GPU the loss is 0.15 after 50 epochs. Doing more epochs doesnt really lower the losses in either cases.
Why is performance after training on CPU better than GPU?
I have tried changing the random seeds for both data and model, and these results are independent of the random seeds.
I have:
Python 3.6.2
PyTorch 0.3.0
CUDNN_MAJOR 7
CUDNN_MINOR 0
CUDNN_PATCHLEVEL 5
Edit:
I also use PyTorch's weight normalizaton torch.nn.utils.weight_norm
on the GRU and on the fully-connected layer.