How to quantize inputs and outputs of optimized tflite model

Question

I use the following code to generate a quantized tflite model

import tensorflow as tf

def representative_dataset_gen():
  for _ in range(num_calibration_steps):
    # Get sample input data as a numpy array in a method of your choosing.
    yield [input]

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()

But according to post training quantization:

The resulting model will be fully quantized but still take float input and output for convenience.

To compile tflite model for Google Coral Edge TPU I need quantized input and output as well.

In the model, I see that the first network layer converts float input to input_uint8 and the last layer converts output_uint8 to the float output. How do I edit tflite model to get rid of the first and last float layers?

I know that I could set input and output type to uint8 during conversion, but this is not compatible with any optimizations. The only available option then is to use fake quantization which results in a bad model.

If you wish a fully quantised network (uint8 inputs), then you have to use the tflite converter differently. Either through dummy_quantisation, either export a network using quantisation-aware-training (including ranges) and use that to export. Post training quantisation uses fp32 inputs and either dequantises and uses fp32 kernels or quantises on-the-fly (reference from tf page below). "To further improve latency, hybrid operators dynamically quantize activations to 8-bits and perform computations with 8-bit weights and activations" — Konstantinos Monachopoulos, Jul 03 '19 at 15:13
Actually you are right. Even by using a calibration dataset and capture the input ranges the extracted tflite has still fp32 inputs and outputs with post train quantisation. Only with quantisation-aware-training and dummy quantisation you can extract a fully quantised network (with u8 input - output). — Konstantinos Monachopoulos, Jul 03 '19 at 17:35
@KonstantinosMonachopoulos are you sure? It looks like you can do full integer (inputs/outputs included) without quantization-aware training. I think it can be done in a pure post-training scenario, see accepted answer [here](https://stackoverflow.com/questions/61083603/how-to-make-sure-that-tflite-interpreter-is-only-using-int8-operations) and documentation [here](https://www.tensorflow.org/lite/performance/post_training_integer_quant) — Corey Cole, Sep 18 '20 at 15:26

score 2 · Accepted Answer · answered Jul 08 '19 at 02:12

2

You can avoid the float to int8 and int8 to float "quant/dequant" op by setting inference_input_type and inference_output_type (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/python/lite.py#L460-L476) to int8.

answered Jul 08 '19 at 02:12

J.L.

134
2

Thanks a lot. Source of my problems was that I set inference_type to uint8 and not inference_input_type. – Ivan Kovtun Jul 08 '19 at 17:14
In my case, my Keras model only has an input layer of `uint8`, but is **not** quantized (e.g. `float32`). This was to ensure RGB files are imported quickly without typecasts on the CPU. However, I get the error `tensorflow/lite/toco/tooling_util.cc:2258] Check failed: array.data_type == array.final_data_type Array "input_1" has mis-matching actual and final data types (data_type=uint8, final_data_type=float)`... which seems to suggest that unless my model is fully quantized, it won't accept `uint8` as an input dtype. – Mateen Ulhaq Sep 17 '19 at 14:15

score 2 · Answer 2 · answered Mar 25 '23 at 21:20

This:

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    # Model has only one input so each data point has one element.
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

tflite_model_quant = converter.convert()

generates a Float32 model with Float32 inputs and outputs. This:

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model_quant = converter.convert()

generates a UINT8 model with UINT8 inputs and outputs

You can make sure this is the case by:

interpreter = tf.lite.Interpreter(model_content=tflite_model_quant)
input_type = interpreter.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter.get_output_details()[0]['dtype']
print('output: ', output_type)

which returns:

input:  <class 'numpy.uint8'>
output:  <class 'numpy.uint8'>

if you went for a full UINT8 quantization. You can double check this by inspecting your model visually using netron

score 1 · Answer 3 · edited Mar 04 '21 at 06:43

1

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir
converter.optimizations = [tf.lite.Optimize.DEFAULT] 
converter.representative_dataset = representative_dataset
#The below 3 lines performs the input - output quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()

edited Mar 04 '21 at 06:43

Suraj Rao

29,388
11
94
103

answered Mar 04 '21 at 06:43

Sandeep Vivek

31
5

4

While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Suraj Rao Mar 04 '21 at 06:44
I'd very much love to see an explanation as well to better understand who this all works – nickc Feb 16 '23 at 02:08

How to quantize inputs and outputs of optimized tflite model

3 Answers3

Linked