I have investigated the quantization in tensorflow a bit and applied it to convert float operations into quant oeprations.
In my case I have still a float input to the net. The input gets quantized right before entering the quant operations. Tensorflow prefers keeping float values as long as possible in order to be compatible to float operations.
This is also the reason, why tensorflow keeps the min and max float ranges after the float input gets quantized into 8bit integer format.
The min and max float values as a result from quantization are also inputs to the quant operations.
In your case, the Quant_conv2d operation does a convolution with inputs:
- unsigned 8bit data form qunatization
- unsigned 8bit quantized kernel values
The outputs are:
- result as 32 bit
- the new min and max range as float values
The new float ranges are calculated from the ranges of the kernel values and the ranges of the input using the QuantizationRangeForMultiplication function stated in:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/quantization_utils.h
As stated, the output is a 32 bit with min and max float values to map to the absolute values and possbily convert the 8bit quantized format back to float.
Hope this helps to understand Tensorflow quantization algorithms.