INT8 quantization for FP32 matrix multiplication

Question

I tried to apply INT8bit quantization before FloatingPoint32bit Matrix Multiplication, then requantize accumulated INT32bit output to INT8bit. After all, I guess there's a couple of mix-ups somewhere in the process. I feel stuck in spotting those trouble spots.

data flow [Affine Quantization]:

input(fp32) -> quant(int8) ____\ matmul(int32) -> requant(int8) ->deq(fp32)
input(fp32) -> quant(int8) ----/

My Pseudo Code
INPUT(FP32) :
 Embedded Words in Tensor (shape : [1, 4, 1024, 256]) A and B (B is the same as A)

input A(=B) : enter image description here

EXPECTING OUTPUT(FP32) : 
 Embedded Words in Tensor (shape : [1, 4, 1024, 1024]) AB(after matrix multiplication to itself)

do while(true):
    # convert A and B of FP32 into INT8
    A_zero_offset = torch.empty(A.shape)
    A_zero_offset = torch.zeros_like(A_zero_offset)    # offset to be zero **[Question1]**
    scale = 255 / (torch.max(A) - torch.min(B))    # 2^8 - 1 = 255
    A_quantized = np.round((A - A_zero_offset) * scale)

    # likewise
    B_quantized = A_quantized

    AB = A_quantized.matmul(B_quantized.transpose(-1, -2))
    # now accumulated datatype is INT32

    AB_offset = torch.empty(AB.shape)
    AB_offset = AB_offset.new_full(AB.shape, torch.min(AB)) # offset to be AB's min element **[Question 1]**
    scale_AB = 255 / (torch.max(AB) - torch.min(AB))    **[Question 2]** 
    AB_requantized = np.round((AB - AB_offset) * scale_AB)

    # dequantize AB(INT8 at the status quo) into FP32
    **[Question 3]**

[Question 1] : does it make sense to set A's offset to be zero and AB's to be min(AB)?

[Question 2] : What operation should I follow with the scale calculation, "max(AB) - min(AB)" or any otherwise method?

[Question 3] : After all, what operation do I have to follow especially with the scale and offset calculation when to dequantize the result into FP32?

bitbang · Answer 1 · 2021-10-18T13:22:50.667

I believe this approach is totally wrong because for every embedded word tensor there is an different max and min values, so this bug changes your data continuity. I assume you are aware of you loose information anyway because you cant sequezee(map) fp32 to int8 in same tensor shapes

import torch
import numpy as np

# create Pseudo tensor
a = torch.tensor([[0.654654, 1.654687, -0.5645365],
                  [5.687646, -5.662354, 0.6546646]], dtype=torch.float32)
print(a.dtype)
print(a)
# torch.float32
# tensor([[ 0.6547,  1.6547, -0.5645],
#         [ 5.6876, -5.6624,  0.6547]])

b = a.clone().int()
print(b)
# tensor([[ 0,  1,  0],
#         [ 5, -5,  0]], dtype=torch.int32)

# converting to int8 please note range is here -128 to + 128
c = a.clone().to(torch.int8)
print(c)
# tensor([[ 0,  1,  0],
#         [ 5, -5,  0]], dtype=torch.int8)

# converting to uint8 please note range is here 0 to 255
d = a.clone().byte()
print(d)
# tensor([[  0,   1,   0],
#         [  5, 251,   0]], dtype=torch.uint8)

Your approach(wrong)

A, B = a

A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset)  # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B))  # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)

print(A_quantized.dtype)
print(A_quantized)

# torch.float32
# tensor([ 23.,  58., -20.])

INT8 quantization for FP32 matrix multiplication

1 Answers1