Being inspired by "Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model", I decided to follow through with the caveat of the paper. However, I get confused about setting offset variables during quantization.
INPUT : A (tensor of FP32, [1, 4, 1024, 256])
# Quantization
offset = torch.empty(A.shape)
offset = torch.zeros_like(offset)
scale = 255 / (torch.max(A) - torch.min(A))
A_int8 = (A - offset) * scale
# Probability Distribution
P = norm.pdf(A, torch.mean(A, dim=[2, 3]), torch.std(A, dim = [2,3]))
Q = norm.pdf(A_int8, torch.mean(A_int8, dim=[2, 3]), torch.std(A_int8, dim = [2,3]))
P = torch.from_numpy(P)
Q = torch.from_numpy(Q)
# KLD
kld = (P * (P / Q).log()).sum()
print(kld)
# After this, I'm going to apply self-attention operation.
# B_int8 = A_int8.clone()
# AB = A_int8.matmul(B_int8.transpose(-1, -2))
I get positive kld value for now, but I'm not sure that I went through the right way to do it. Any help or advice is appreciated.