0

My code

I am working with a simple code that uses this function in an academic project:

void calculateDistanceMatrix(const float data[M][N],
                             float distance[M][N]) {
    float sum = 0.0;
    for(int i = 0; i < M; i++) {
        for(int j = i+1; j < M; j++) {
            for(int k = 0; k < N; w++) {
                sum += (data[i][k] - data[j][k]) *
                       (data[i][k] - data[j][k]);
            }
            distance[i][j] = sum;
            distance[j][i] = sum;
            distance[i][i] = 0.0;
            sum = 0.0;
        }
    }
}

My target architecture

My code should perform no more than this simple matrix operation over 'data' and fill the 'distance' matrix with the results. In my academic project, however, I am interested in how the compiler optimizes these vector operations for the ARM architecture I am working with. The command line for the compilation contains the following:

arm-none-eabi-gcc <flags> <my_sources> -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard <more_flags>

My program is intended to be run in an embedded Xilinx Zynq-7000 device, whose architecture includes the NEON optimized instruction set for vector operations (described in this nice presentation)

My issue

I have to track the performance of the execution of the vector operations in the 'calculateDistanceMatrix' function with and without compiler optimizations. I notice the assembly output includes the shared NEON and VFP instructions for the vector load and store operations (detailed in ARM's Assembler Reference for Version 5.0):

ecf37a01    vldmia  r3!, {s15}
ecf26a01    vldmia  r2!, {s13}
e1530000    cmp r3, r0
ee777ae6    vsub.f32    s15, s15, s13
ee077aa7    vmla.f32    s14, s15, s15
1afffff9    bne 68 <calculateDistanceMatrix+0x48>
eca17a01    vstmia  r1!, {s14}

I couldn't find a way to compile this code such that these optimized instructions are not used.

Do you know any compilation configuration or code trick that could avoid these instructions? Appreciate any help on this issue.

  • https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html - something with `+nofp` ? – Eugene Sh. Jun 22 '18 at 20:50
  • I had not seen these '-march' options, thanks @EugeneSh. ! I could not make them work, though... After editing my compilation command line somehow both the "-march=...+nofp" and "-mcpu=...+nofp" failed with other library in my project. I guess I would have to spend some time later to isolate it and retry this compilation. My other bet was the pragma "#pragma GCC target ("armv7+nofp")", but it made no difference in the assembly either. – augustomafra Jun 22 '18 at 22:03
  • 1
    Maybe, if you had done some research before asking the question, you wouldn't have mistaken scalar floating-point operations for vector operations. Try reading the architecture reference manual entry for the instructions next time. – EOF Jun 23 '18 at 10:55
  • You should use the same arch/cpu/fp options for all objects as it will change the ABI – M.M Jul 11 '18 at 00:34
  • 1
    You could try different optimization levels, e.g. `-Os`, `-O1`. I believe it is normal to use `-O0 -g` for debugging – M.M Jul 11 '18 at 00:35

2 Answers2

1

The instructions you quoted are not vector operations: vsub.f32 s15, s13, s15 This is a simple 32-bit floating-point subtraction. You can tell by the use of the 32-bit form of the S-register and the .f32 suffix in the instruction

Kyrill
  • 2,963
  • 1
  • 8
  • 12
  • 1
    You are absolutely right. After revisiting this issue, I found out that the actual problem is that my environment was set to always build in debug mode. After fixing this and building for production, I could find the right optimized code, which uses the VLDM and VSTM instructions – augustomafra Oct 05 '18 at 22:10
1

I revisited this issue and found out that my environment was set to build in debug mode, thus no optimization was really taking place.

The actual optimized code uses the VLDM and VSTM instructions. They are not generated, however, when I add the pragma

#pragma GCC optimize ("O0")

in my source file.