0

Has anyone tried a custom branch prediction algorithm for GPU computing in any raytracing collision test kernel (Cuda, Opencl)?

Should I even worry about performance for low depth(2-5)?

Example:

 trace for the first group of rays
     check for previous ray depth predictor, if zero, guess zero.
                                   if gt one, guess d>=1
           go one level deeper in tracing kernel.(with pseudo stack & recursivity)

                           recursively repeat

                     go out of one depth after saving guess state

                  recursively go out of depths.

Can this beat hardware-level prediction? Can this even make total tracing time better?

The "if" sentences in this pseudo codes shouldnt contain any "if" . So it just computes zero or actual value depending on prediction value.

Thanks.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97

1 Answers1

1

This is from the OpenCL Optimization Manual:

Use predication rather than control-flow. The predication allows the GPU to execute both paths of execution in parallel, which can be faster than attempting to minimize the work through clever control-flow. The reason for this is that if no memory operation exists in a ?: operator (also called a ternary operator), this operation is translated into a single cmov_logical instruction, which is executed in a single cycle. An example of this is:

If (A>B) { C += D; } else { C -= D; }

Replace this with:

int factor = (A>B) ? 1:-1; C += factor*D;

In the first block of code, this translates into an IF/ELSE/ENDIF sequence of conditional code, each taking ~8 cycles. If divergent, this code executes in ~36 clocks; otherwise, in ~28 clocks. A branch not taken costs four cycles (one instruction slot); a branch taken adds four slots of latency to fetch instructions from the instruction cache, for a total of 16 clocks. Since the execution mask is saved, then modified, then restored for the branch, ~12 clocks are added when divergent, ~8 clocks when not.

In the second block of code, the ?: operator executes in the vector units, so no extra CF instructions are generated. Since the instructions are sequentially dependent, this block of code executes in 12 cycles, for a 1.3x speed improvement. To see this, the first cycle is the (A>B) comparison, the result of which is input to the second cycle, which is the cmov_logical factor, bool, 1, -1. The final cycle is a MAD instruction that: mad C, factor, D, C. If the ratio between conditional code and ALU instructions is low, this is a good pattern to remove the control flow.

Seems like your 0/1 selection could be based on prediction. Not sure if this is what you are looking for.

Source: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/

Community
  • 1
  • 1
Austin
  • 1,018
  • 9
  • 20