I tried to apply INT8bit quantization before FloatingPoint32bit Matrix Multiplication, then requantize accumulated INT32bit output to INT8bit. After all, I guess there's a couple of mix-ups somewhere in the process. I feel stuck in spotting those trouble spots.
data flow [Affine Quantization]:
input(fp32) -> quant(int8) ____\ matmul(int32) -> requant(int8) ->deq(fp32)
input(fp32) -> quant(int8) ----/
My Pseudo Code
INPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 256]) A and B (B is the same as A)
input A(=B) : enter image description here
EXPECTING OUTPUT(FP32) :
Embedded Words in Tensor (shape : [1, 4, 1024, 1024]) AB(after matrix multiplication to itself)
do while(true):
# convert A and B of FP32 into INT8
A_zero_offset = torch.empty(A.shape)
A_zero_offset = torch.zeros_like(A_zero_offset) # offset to be zero **[Question1]**
scale = 255 / (torch.max(A) - torch.min(B)) # 2^8 - 1 = 255
A_quantized = np.round((A - A_zero_offset) * scale)
# likewise
B_quantized = A_quantized
AB = A_quantized.matmul(B_quantized.transpose(-1, -2))
# now accumulated datatype is INT32
AB_offset = torch.empty(AB.shape)
AB_offset = AB_offset.new_full(AB.shape, torch.min(AB)) # offset to be AB's min element **[Question 1]**
scale_AB = 255 / (torch.max(AB) - torch.min(AB)) **[Question 2]**
AB_requantized = np.round((AB - AB_offset) * scale_AB)
# dequantize AB(INT8 at the status quo) into FP32
**[Question 3]**
[Question 1] : does it make sense to set A's offset to be zero and AB's to be min(AB)?
[Question 2] : What operation should I follow with the scale calculation, "max(AB) - min(AB)" or any otherwise method?
[Question 3] : After all, what operation do I have to follow especially with the scale and offset calculation when to dequantize the result into FP32?