How can I improve fixed-point data type utilization?

Question

I'm trying to use the quantization for a convolutional neural network in order to reduce memory occupation going from the FP32 bit data type to Int16 one. The problem is that I'm obtaining poor results and since it's the first time that I use this kind of representation I have some doubts about the correct implementation.

First of all, I'm quantizing both the input data and the weights using the following functions (uniform quantization):

#define FXP 16

int16_t quantize(float a, int fxp){
    int32_t maxVal = ((1 << (FXP-1)) - 1);
    int32_t value = a * (1 << fxp); //mapping
    
    //rounding
    if (a>=0){
        value += 0.5f;
    }else{
        value -= 0.5f;
    }
    
    //clipping
    if(value > maxVal){
        return (int16_t)maxVal;
    }else if(value < -maxVal){
    
        return -(int16_t)maxVal;
    }else{
        return (int16_t)value;
    }
}


int16_t value = quantize(test_data[i],10);

In this case I'm using a Q5.10 format (from the data I have it seems the best format to use). Once all numbers have been converted, arithmetic within the network (multiplications and sums/subtractions - for example used in convolutions), is implemented in this way:

  for(int k=0; k<output_fea; k++){
      int32_t accumulator = 0;
      
      for(int l=minimum; l<maximum; l++){
        for(int j=0; j<input_fea; j++){
          accumulator += (data[l][j]*weights[k][l][j] + (1<<((FXP_VALUE-1))))>>FXP_VALUE; //both data and weights array are int16_t
        }
      }
      
      //before going from int32_t to int16_t
      if(accumulator>INT16_MAX){
        accumulator=INT16_MAX;
      }else if(accumulator<INT16_MIN){
        accumulator=INT16_MIN;
      }

      result[i][k] = (int16_t)ReLU(accumulator); //result is int16_t
    }  
  }

Is it correct what I am doing ? Are there any steps I could take to improve the results and reduce approximations ?

Don't do it. The quantization gives benefit only because CPUs have specialized instructions like accumulating 4 int8 dot products into a single int32. Otherwise the compiler will generated branchy code with no vectorization which will be slow. — tstanisl, Apr 05 '23 at 09:26
@tstanisl unfortunately I have to do so because of the memory constraints on the platform in which the code will be executed — Dresult, Apr 05 '23 at 09:34
So try using [BFLOAT](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format). It is basically 32-bit float with lower 16-bits chopped off. It can be trivially converted to a true fp32. Usually this type is precise enough to give accurate result from NN. — tstanisl, Apr 05 '23 at 09:54
@tstanisl That looks like a good answer! You can put it in the "answer" box to preserve for future. — anatolyg, Apr 05 '23 at 11:16
`value += 0.5f;` is wrong. `value` is an integer type, `in32_t`. Adding `0.5f` converts it to `float` and adds ½, and then the assignment back to `value` converts it to `int32_t`, which truncates the ½. E.g., if `value` is 3, then the addition produces 3½ and the conversion back to `int32_t` produces 3. This statement does nothing. You need to round what the value is a `float`, in the `a * (1 << fxp)` expression. You can do that using the `roundf` or `lroundf` functions, or `nearbyintf` or `lrintf` if you want rounding with ties-to-even instead of ties-to-away. — Eric Postpischil, Apr 05 '23 at 12:14
`value < -maxVal` may be clipping too much; that could be `value < -maxVal-1`, with `return -maxVal-1;`. — Eric Postpischil, Apr 05 '23 at 12:15

score 1 · Accepted Answer · answered Apr 05 '23 at 09:36

You should check how much error is introduced into your values by rounding and clipping. Continue working with floating-point values, but introduce just rounding; then introduce just clipping; then introduce both. How much error is introduced in your results?

Also, regarding fixed-point format: even if it seems the best format to use, maybe it's not the best. Try different formats; check the error in results for each format. Try using different formats at different stages of calculation (i.e. at different layers). Each application has its own problems, so you have to gather intuition for how much rounding and clipping (separately) affect your results.

If your results are very sensitive to rounding errors, you might want to use int16 for some stages and float32 for others.

How can I improve fixed-point data type utilization?

1 Answers1