I am currently interested in model quantization, especially that of post-training quantization of neural network models.
I simply want to convert the existing model (i.e., TensorFlow model) using float32 weights into a quantized model with float16 weights.
Simply changing each data type of every weight (i.e., float32 -> float16) would not work, would it?
Then, how can I manually perform this type of quantization?
Suppose i have weight matrix that look like; [ 0.02894312 -1.8398855 -0.28658497 -1.1626594 -2.3152962 ]
Then simply converting every weight in the matrix into float16 dtype would generate the following matrix; [ 0.02895 -1.84 -0.2866 -1.163 -2.314 ]
which, i think will not be the way how the off-the-shelf libraries' (e.g., TF Lite) quantization work...
Then how can i manually perform this type of quantization properly?