Arm Mali T-624 STUCK EXECUTION TIME IN 12666 ms

Question

I am using this GPU for my thesis in the university

I am running a lot of different kernels on this thing and the execution time is stuck at 12666.6689 ms, even though I had a loop with 88 instructions * 100m iterations.

__kernel void scalar_mult_add(__global  int * list)
{
    unsigned int x=38;
    unsigned int y=38;
    for(int i=0; i<1000000  ; i++){
        y=x*y;
        x=x+y;
    }
}

The only thing that can make the execution time get increase is adding x!=0 inside the for loop statements

__kernel void scalar_mult_add(__global  int * list)
{
    unsigned int x=38;
    unsigned int y=38;
    for(int i=0; i<1000000  && x!=0 ; i++){
        y=x*y;
        x=x+y;
    }
}

Why does this thing happens all the time? I can't understand: e.g. 88 million instructions have the same execution time as 1 million instructions, even though I don't have that much units to execute such big kernel at the same time like 1 million instructions.

Why does adding a single x!=0 statement in the loop makes the execution-time increased that much and couple of additions inside the for loop do not?

the code has constant input and produces no output, the compiler optimiser is probably easily able to eliminate most if not all of your code — Alan Birtles, Nov 20 '21 at 16:50
Have you examined the actual compiler output? What optimization level do you have turned on? The first example you have can nearly be optimized away because the loop doesn't really "do anything". Try looking at the actual ASM output to see. — Ron Beyer, Nov 20 '21 at 16:50
You seem to have compiler optimizations disabled. [If you observe the generated assembly](https://godbolt.org/z/34orM3TKK) (On a CPU, admittedly, because that's what I have handy) both of your functions can be optimized away to _nothing_. Because both functions have no observable effect. — Drew Dormann, Nov 20 '21 at 16:54
Yes i do have the "-cl-opt-disable" in my cpp file. Would you recomend another optimization flag to use? The reason i dont need the actual results of the kernel is because i do not use the results of the kernel because i just want to validate the gpu units, pipeline depth and to check out the available units(vector and scalar ones). How can i make the compiler not to eliminate my code which has no effect ? — marios, Nov 20 '21 at 20:45
Mali compiler doesn't have an option to disable optimization. Change your kernel to keep the outputs, or have a side-effect on the loop value during calculation. — solidpixel, Nov 22 '21 at 07:50

solidpixel · Answer 1 · 2021-11-22T14:45:21.823

Why does adding a single x!=0 statement in the loop make the execution-time increase?

In the original case the loop doesn't do anything - the output isn't kept and the loop result is not used in any further computation. As is noted in the comments above, the compiler is probably just optimizing out the loop.

Adding the x != 0 to the loop condition check means that the loop result is "used" - you need the previous iteration loop result to determine if you keep iterating. The code is still pointless (no output) but the compiler doesn't see this as dead-code, so it stays in.

However, note that Mali-T624 has SIMD vector unit. Writing a dependent scalar loop like this means that you are effectively killing any auto-vectorization in the compiler. Highly recommend using vec4 data types for the computation.

If you want some basic static analysis you might want to look at the Mali Offline Compiler, which is freely downloadable in Arm Mobile Studio. Note that compiling OpenCL kernels requires macOS or Linux, but if you are on Windows you can run the Linux binary under WSL.

Thanks for your answer @solidpixel but i use the opencl flag -cl-opt-disable. Is this flag helping for my dead code problem?(I mean does it tell the compiler not to remove the dead code or not) — marios, Nov 22 '21 at 12:43
You have absolutely no way to tell. Relying on something you can't verify for a thesis project seems unwise. Write a kernel that does something sensible and you'll get sensible results. — solidpixel, Nov 22 '21 at 14:44

Arm Mali T-624 STUCK EXECUTION TIME IN 12666 ms

1 Answers1