Why does pytorch matmul get different results when executed on cpu and gpu?

Question

I am trying to figure out the rounding difference between numpy/pytorch, gpu/cpu, float16/float32 numbers and what I'm finding confuses me.

The basic version is:

a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()@b.numpy() - a@b)

The result is all zeros as expected, however

print((a.cuda()@b.cuda()).cpu() - a@b)

gets non-zero results. Why is Pytorch float32 matmul executed differently on gpu and cpu?

An even more confusing experiment involves float16, as follows:


a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()@b.numpy() - a@b)
print((a.cuda()@b.cuda()).cpu() - a@b)

these two results are all non-zero. Why are float16 numbers handled differently by numpy and torch? I know cpu can only do float32 operations and numpy convert float16 to float32 before computing, however the torch calculation is also executed on cpu.

And guess what, print((a.cuda()@b.cuda()).cpu() - a.numpy()@b.numpy()) gets an all zero result! This is pure fantasy for me...

The environment is as follow:

python: 3.8.5
torch: 1.7.0
numpy: 1.21.2
cuda: 11.1
gpu: GeForce RTX 3090

On the advice of some of the commenters, I add the following equal test

(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
((a.cuda()@b.cuda()).cpu().numpy() - a.numpy()@b.numpy()).any()

respectively directly following the above five print functions, and the results are:

False
True
True
True
False

And for the last one, I've tried several times and I think I can rule out luck.

To my knowledge the only way to have reproducible results with float types is to use strict mode, and I don't know what Python or any of these libraries use. You could try changing your difference-test to an equals-test to rule out the subtraction introducing the error, though I would expect `x-x` to become `0.0` regardless of mode just as you did. — Zyl, Jan 10 '22 at 10:05
@talonmies it means some element of the returned tensor have non-zero values, for example, the above first print((a.cuda()@b.cuda()).cpu() - a@b) returned: tensor([[-1.6999e-04, 7.8678e-06, -1.6534e-04, -1.2589e-04, -1.5211e-04], [ 3.4809e-05, 1.1599e-04, 9.6798e-05, 2.5213e-05, 1.9252e-05], [-1.6284e-04, 4.0352e-05, -2.1398e-04, -7.8559e-05, -2.4378e-04]]) — Kevin Sheng, Jan 10 '22 at 10:34
@talonmies while for all-zero results I mean a tensor with 0.0 at each term — Kevin Sheng, Jan 10 '22 at 10:53
And what do they represent as relative errors of the product quantities? Are you sure that the zeros are not just representation issues. `np.all_zero` and `np.all_close` are very useful tools, try them and see what you get — talonmies, Jan 10 '22 at 10:53
@talomies thanks for your advice, I have edited the post with additional information accordingly — Kevin Sheng, Jan 10 '22 at 11:34

score 1 · Answer 1 · answered Jan 11 '22 at 04:41

The differences are mostly numerical, as mentioned by @talonmies. CPU/GPU and their respectively BLAS libraries are implemented differently and use different operations/order-of-operation, hence the numerical difference.

One possible cause is sequential operation vs. reduction (https://discuss.pytorch.org/t/why-different-results-when-multiplying-in-cpu-than-in-gpu/1356/3), e.g. (((a+b)+c)+d) will have different numerical properties as compared with ((a+b)+(c+d)).

This question also talks about fused operations (multiply-add) which can cause numerical differences.

I did a little bit of testing, and find that the GPU's output in float16 mode can be matched if we promote the datatype to float32 before computation and demote it afterward. This can be caused by internal intermediate casting or the better numerical stability of fused operations (torch.backends.cudnn.enabled does not matter). This does not solve the case in float32 though.

import torch

def test(L, M, N):
    # test (L*M) @ (M*N)
    for _ in range(5000):
        a = torch.rand(L, M, dtype=torch.float16)
        b = torch.rand(M, N, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()
        if (cpu_result-gpu_result).any():
            print(f'({L}x{M}) @ ({M}x{N}) failed')
            return
    else:
        print(f'({L}x{M}) @ ({M}x{N}) passed')


test(1, 1, 1)
test(1, 2, 1)
test(4, 1, 4)
test(4, 4, 4)

def test2():
    for _ in range(5000):
        a = torch.rand(1, 2, dtype=torch.float16)
        b = torch.rand(2, 1, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()

        half_result = a[0,0]*b[0,0] + a[0,1]*b[1,0]
        convert_result = (a[0,0].float()*b[0,0].float() + a[0,1].float()*b[1,0].float()).half()

        if ((cpu_result-half_result).any()):
            print('CPU != half')
            return
        if (gpu_result-convert_result).any():
            print('GPU != convert')
            return
    else:
        print('All passed')

test2()

Output:

(1x1) @ (1x1) passed
(1x2) @ (2x1) failed
(4x1) @ (1x4) passed
(4x4) @ (4x4) failed
All passed

You can tell that when the inner dimension is 1, it passes the check (no multiply-add/reduction needed).

Why does pytorch matmul get different results when executed on cpu and gpu?

1 Answers1

Linked