I am trying to optimize a raytracer code on beagleboard and for that I am using the NEON coprocessor. There is a matrix multiplication function that is called multiple times which I have written in inline assembly. However, for some reason the results are not accurate. Here is my code:
void VecMatMult(float Vt[4], float M[4][4], float V[4])
{
__asm__ volatile(
"vldmia %1, {q1-q4} \n\t" // Load the Matrix in the quad registers
"vldmia %2, {q5} \n\t" //Load the Vector
"vmul.f32 q0, q1, d10[0] \n\t" //Calculate the matrix product
"vmla.f32 q0, q2, d10[1] \n\t"
"vmla.f32 q0, q3, d11[0] \n\t"
"vmla.f32 q0, q4, d11[1] \n\t"
"vstmia %0, {q0} \n\t" //Store the output
:
:"r" (Vt), "r" (M), "r" (V)
:"q0", "q1", "q2", "q3", "q4", "q5"
);
}
The funny thing is when I call this code in a separate program to test if it works, the results are perfect. However, when it's called in my main program several times, the results are not correct. Any help will be appreciated as I am literally clueless at the moment.