for inner product in pytorch, `dot` and `inner` give wildly different results with dtype = torch.bfloat

Question

I have two 1-dimensional PyTorch tensors (of type bfloat16), and I want to compute their inner/dot product.

Should I use torch.dot or torch.inner? I thought it shouldn't really matter which one, but they give me wildly different results. I experimented with some other methods too, and found that some behave like torch.dot and some behave like torch.inner (see comments in code below). Why is this? And which is the right one to use?

import torch
torch.manual_seed(17) # set seed for replication

dtype=torch.bfloat16
# make two 1d tensors 
a = torch.rand([10000], dtype=dtype)
b = torch.rand([10000], dtype=dtype)

x1 = torch.dot(a,b)
# or, equivalently:
# x1 = torch.matmul(a,b)
# x1 = a @ b

x2 = torch.inner(a,b)
# or equivalently: 
# x2 = (a*b).sum(-1)
# x2 = torch.mul(a,b).sum(-1)
# x2 = torch.matmul(a.unsqueeze(0),
#                   b.unsqueeze(-1)).squeeze()

The results are not equal. They're not even close.

print(f"""{x1 = }\n{x2 = }
{torch.equal(x1, x2) = }
{torch.isclose(x1, x2) = }""")

x1 = tensor(256., dtype=torch.bfloat16)
x2 = tensor(2464., dtype=torch.bfloat16)
torch.equal(x1, x2) = False
torch.isclose(x1, x2) = tensor(False)

However, if I set dtype=torch.float instead of bfloat16, they end up nearly the same (still some differences due, I suppose, to numerical instability).

x1 = tensor(2477.7292, dtype=torch.bfloat16)
x2 = tensor(2477.7295, dtype=torch.bfloat16)
torch.equal(x1, x2) = False
torch.isclose(x1, x2) = tensor(True)

What is the best way to get the inner product reliably, if the type is/may be bfloat16?

EDIT:

Python 3.10.10 on CPU.

Sandro · Answer 1 · 2023-07-20T08:36:20.577

I did some digging but don't really have a satisfying answer. Unfortunately, it all depends how the order of operations are done internally and it can even differ between the BLAS implementation between CPU and GPU. See also this and this.

But to bring a bit of new information, the result of 256 is no coincidence, it's what you get when you sum a lot of numbers smaller than one in sequential order using bfloat16.

a = torch.rand([10000], dtype=dtype)
x = torch.tensor(0, dtype=dtype)
for a1 in a:
    x += a1

This order of operation is very bad for floats with small fractional precision such as bfloat16 and this is what must be happening internally in torch.matmul.

Edit:

And even funnier:

print(torch.tensor(257., dtype=torch.bfloat16))

tensor(256., dtype=torch.bfloat16)

I'm still unsure what to do though (but thanks very much for the links and information). I suppose doing a dot/inner product of two vectors in bfloat16 is unreliable... but from a practical perspective, if one must do such an operation, is it true that `a.inner(b)` is in general a better choice than `a.dot(b)` (as it seems here), or does this depend on whether you're on CPU or GPU, or some other considerations? — postylem, Jul 19 '23 at 17:42

for inner product in pytorch, `dot` and `inner` give wildly different results with dtype = torch.bfloat

1 Answers1