0

I am trying to optimize a raytracer code on beagleboard and for that I am using the NEON coprocessor. There is a matrix multiplication function that is called multiple times which I have written in inline assembly. However, for some reason the results are not accurate. Here is my code:

void VecMatMult(float Vt[4], float M[4][4], float V[4])
{

   __asm__ volatile(

    "vldmia %1, {q1-q4} \n\t" // Load the Matrix in the quad registers
    "vldmia %2, {q5} \n\t" //Load the Vector
    "vmul.f32 q0, q1, d10[0] \n\t" //Calculate the matrix product
    "vmla.f32 q0, q2, d10[1] \n\t"
    "vmla.f32 q0, q3, d11[0] \n\t"
    "vmla.f32 q0, q4, d11[1] \n\t"
    "vstmia %0, {q0} \n\t" //Store the output
    :
    :"r" (Vt), "r" (M), "r" (V)
    :"q0", "q1", "q2", "q3", "q4", "q5"
    );

}

The funny thing is when I call this code in a separate program to test if it works, the results are perfect. However, when it's called in my main program several times, the results are not correct. Any help will be appreciated as I am literally clueless at the moment.

Paul R
  • 208,748
  • 37
  • 389
  • 560
fussy
  • 37
  • 9
  • You probably have some house-keeping issues - I suggest re-coding this, at least temporarily, using intrinsics, and let the compiler take care of the house-keeping. – Paul R May 28 '13 at 17:49
  • 2
    You should compare the generated code between the two versions. But most probably this occurs because you don't list "memory" in your clobber list. So the compiler is allowed to not load your calculated value from memory. – Nico Erfurth May 28 '13 at 18:13
  • I tried putting in "memory" in the clobbered list. It didn't work – fussy May 29 '13 at 00:27
  • 1
    You'll have to post some of the generated code (including the surrounding code to this function.) – Nico Erfurth May 29 '13 at 07:31
  • Are they not accurate or completely off the track? You should also pass the C code for the same function IMHO so people can understand what are you really trying to do. – auselen May 29 '13 at 09:06

1 Answers1

2

I don't know exactly how the inline assembly handles register preserving, but according to ATPCS, d8~d15 have to be preserved prior to usage, so it's not very wise to use them (if not absolutely necessary) to start with.

It will cause performance loss (if the inline assembly does a proper job), or it will do something 'unreasonable' (if the inline assembly fails)

Try to use q8~q13 instead. It would be a safe bet.

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25