I am trying to quantize the weights and biases of my neural network to a 16 bit integer format. The reason for this is to use these arrays in CCS to program the network on a MCU. While I followed the process for post-training quantization using TensorflowLite and also got the results for a conversion to the uint8 format, I am not sure how I can also achieve this for a 16-bit format. My code for the uint8 conversion was as follows:
def representative_data_gen():
data = np.array(x_train, dtype = np.float32);
for input_value in data:
yield [input_value]
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Set the optimization mode
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Pass representative dataset to the converter
converter.representative_dataset = representative_data_gen
# Restricting supported target op specification to INT8
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# Convert and Save the model
tflite_model = converter.convert()
open("clap_model.tflite", "wb").write(tflite_model)
My x_train
array contains float values in the float32 Format. As I read through the approaches available on the tensorflow lite page, I did see a case where they use a 16*8 approach but the weights still remain in a 8-bit format in that scenario.
If there is any other way to convert these floating point values or even the obtained 8-bit integers to a 16-bit integer format, that would also be extremely helpful. The only approach I can think of is a manual quantization from floating point to 16-bit integer approach but my guess is that would be a bit computationally tedious since I copy or use the weights and biases and then pass them through the said quantization function.