I'm trying to use the quantization for a convolutional neural network in order to reduce memory occupation going from the FP32 bit data type to Int16 one. The problem is that I'm obtaining poor results and since it's the first time that I use this kind of representation I have some doubts about the correct implementation.
First of all, I'm quantizing both the input data and the weights using the following functions (uniform quantization):
#define FXP 16
int16_t quantize(float a, int fxp){
int32_t maxVal = ((1 << (FXP-1)) - 1);
int32_t value = a * (1 << fxp); //mapping
//rounding
if (a>=0){
value += 0.5f;
}else{
value -= 0.5f;
}
//clipping
if(value > maxVal){
return (int16_t)maxVal;
}else if(value < -maxVal){
return -(int16_t)maxVal;
}else{
return (int16_t)value;
}
}
int16_t value = quantize(test_data[i],10);
In this case I'm using a Q5.10 format (from the data I have it seems the best format to use). Once all numbers have been converted, arithmetic within the network (multiplications and sums/subtractions - for example used in convolutions), is implemented in this way:
for(int k=0; k<output_fea; k++){
int32_t accumulator = 0;
for(int l=minimum; l<maximum; l++){
for(int j=0; j<input_fea; j++){
accumulator += (data[l][j]*weights[k][l][j] + (1<<((FXP_VALUE-1))))>>FXP_VALUE; //both data and weights array are int16_t
}
}
//before going from int32_t to int16_t
if(accumulator>INT16_MAX){
accumulator=INT16_MAX;
}else if(accumulator<INT16_MIN){
accumulator=INT16_MIN;
}
result[i][k] = (int16_t)ReLU(accumulator); //result is int16_t
}
}
Is it correct what I am doing ? Are there any steps I could take to improve the results and reduce approximations ?