1

I'm using OpenCL (via JOCL) to find minima in a bunch of distance calculations for ray marching. The pseudocode would look something like this:

Start with a point in 3d space.
There are a number of functions to calculate distances
    to that point from various other points. 
    These may be rather complex (transforms, csg etc).
Calculate all of the distances, perhaps into an array
Get the index of the minimum distance in the array..
Use that index to do up other stuff (pigmentation etc).

My implementation is kinda crap though. I don't currently parallelize the distance calculations but I would like to. Here's why I don't:

It's easy enough to get the minimum distance, but to retrieve this index is not obvious. I ended up iterating over the distances and keeping track of the current minimum and its index, but this is obviously garbage in a parallel environment.

Could basically use a tip to steer me in the right direction here, or tell me if Im barking up the wrong tree entirely? (e.g. is this a CPU job?)

Thanks!

Ben Hardy
  • 1,739
  • 14
  • 16
  • 1
    thread-1 could compare distances between two pairs. thread-2 could do same for another two pairs. Then after sync between two threads, a single thread could check between two results and pick the min max one. Now if size is N, number of syncs could be LogN and number of max threads in first step could be N/4. Seems like a reduction. If there are N/4 cores, only LogN sync points-iterations-pseudo-cycles are needed. – huseyin tugrul buyukisik Mar 22 '17 at 14:38
  • but if arrays are not sorted, then a brute-force O(n^2) checks may be needed so it would need N*Log(N) pseudo cycles on an N/4 core system – huseyin tugrul buyukisik Mar 23 '17 at 15:52
  • @huseyintugrulbuyukisik thanks, i'll give that a try. it is a reduction so sorting not required. just picking the min and its index. thanks! – Ben Hardy Mar 23 '17 at 16:51
  • It might be possible to tweak a simple "reduction" kernel (like the one at http://jocl.org/samples/reduction.cl ) to not perform a `+`-reduction but a `min`-reduction, and to not return the *value* but the *index*. Did you consider something like this? (I'd try it out, but am short on time right now) – Marco13 Mar 28 '17 at 11:39

1 Answers1

0

Tested with a RX550 which is a low end graphics card.

1-million element min() function:

__kernel void test(__global float * data,__global int * index)
{
    int id=get_global_id(0);
    float d1=data[id];
    float d2=data[id+get_global_size(0)];
    float f=fmin(d1,d2);
    index[id]=select( index[id+get_global_size(0)], index[id], fabs(f-d1)<=fabs(f-d2) );
    data[id]=f;
}");

initialized data elements with random values and index elements with indices of their own.

Uploading data and index to GPU through pci-e 2.0 8x took: 3.0 ms

computing with global range=512k,256k,128k,...,1 (logN steps) took: 0.3 ms

downloading data[0] and index[0] took: 0.002 ms

This is a straightforward version which may not be the fastest implementation. To get faster, workgroup level sub-reduction can be added with:

  • work_group_scan_inclusive_min(x) for OpenCL 2.0+
  • __local float reductionArray[256] for OpenCL 1.2-

to reduce number of kernel enqueue commands to finish job in less than hundred ? microseconds.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97