0

I am new to tensorflow and quantization, am trying to implement quantized matmul operation for two int8 inputs. Was curious to know the math behind the operation. I see in tensorflow they have implemented the same only for uint8 inputs , would like to know how to use that for signed int8 matmul/conv2D.

More precisely I would like to know how to get the float output range for the matmul/conv2D operation.

Any help would be highly appreciated.

Abhinav George
  • 85
  • 1
  • 10

1 Answers1

0

I have investigated the quantization in tensorflow a bit and applied it to convert float operations into quant oeprations.

In my case I have still a float input to the net. The input gets quantized right before entering the quant operations. Tensorflow prefers keeping float values as long as possible in order to be compatible to float operations. This is also the reason, why tensorflow keeps the min and max float ranges after the float input gets quantized into 8bit integer format. The min and max float values as a result from quantization are also inputs to the quant operations.

In your case, the Quant_conv2d operation does a convolution with inputs:

  • unsigned 8bit data form qunatization
  • unsigned 8bit quantized kernel values

The outputs are:

  • result as 32 bit
  • the new min and max range as float values

The new float ranges are calculated from the ranges of the kernel values and the ranges of the input using the QuantizationRangeForMultiplication function stated in:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/quantization_utils.h

As stated, the output is a 32 bit with min and max float values to map to the absolute values and possbily convert the 8bit quantized format back to float.

Hope this helps to understand Tensorflow quantization algorithms.

  • Hello William, Thank you very much for sharing your views. I did go through how tensorflow has implemented the QuantizationRangeForMultiplication and noticed that it worked for uint8 inputs but I wanted the same for int8 inputs. I was not sure of the math behind it. Would be really happy if you could help me understand the math to make QuantizationRangeForMultiplication work for int8 inputs as well – Abhinav George Nov 01 '18 at 16:04