INT8 quantization for matmul

Question

Being inspired by "Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model", I decided to follow through with the caveat of the paper. However, I get confused about setting offset variables during quantization.

INPUT : A (tensor of FP32, [1, 4, 1024, 256])

# Quantization
offset = torch.empty(A.shape)
offset = torch.zeros_like(offset)
scale = 255 / (torch.max(A) - torch.min(A))
A_int8 = (A - offset) * scale

# Probability Distribution
P = norm.pdf(A, torch.mean(A, dim=[2, 3]), torch.std(A, dim = [2,3]))
Q = norm.pdf(A_int8, torch.mean(A_int8, dim=[2, 3]), torch.std(A_int8, dim = [2,3]))
P = torch.from_numpy(P)
Q = torch.from_numpy(Q)

# KLD
kld = (P * (P / Q).log()).sum()
print(kld)    

# After this, I'm going to apply self-attention operation.
# B_int8 = A_int8.clone()
# AB = A_int8.matmul(B_int8.transpose(-1, -2))

I get positive kld value for now, but I'm not sure that I went through the right way to do it. Any help or advice is appreciated.

score 0 · Answer 1 · answered Oct 23 '21 at 10:09

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P.

KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x))

This is the same as the positive sum of probability of each event in P multiplied by the log of the probability of the event in P over the probability of the event in Q.

KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))

"The K-L divergence is only defined if P and Q both sum to 1 and if Q(i) > 0 for any i such that P(i) > 0."

INT8 quantization for matmul

1 Answers1