2

I'm writing an OpenCL program that - among other things - needs to check a calculated (int) value against a whitelist. My plan was to store the whitelist in constant or shared memory, and then have each thread run a Binary Search using this shared whitelist.

Then I read about things like bank conflicts, where threads are slowed down because they're accessing memory on the same bank, which causes a serialization of access to occur.

Is a binary search going to result in a larger performance loss on OpenCL due to issues like this? Would I be better off with some other search algorithm, like a hash?

Edit Let me clarify my program a bit:

Each thread will do a parralel calculation, but with a different input value. Thus, each thread will get a different output. Each output needs to be checked against the same whitelist.

The kernel will return a bool value that indicates the result of the search.

My concern is that, since each thread is doing an independent binary search, multiple threads will wind up accessing the same bank of the whitelist, causing a serial slowdown.

Xcelled
  • 2,084
  • 3
  • 22
  • 40
  • A good example of bit operation with OpenCL are cases where no computation is done at all (all memory operations). Such that reducing the bandwidth need of global memory can help a lot. -> http://stackoverflow.com/questions/20039890/opencl-shared-memory-among-tasks/20057224#20057224 – DarkZeros Jul 28 '14 at 12:07
  • I've done binary searches in OpenCL before; they are faster than CPU implementation but not blazingly fast. There is some divergence in the conditional code but it doesn't hurt much. – Dithermaster Jul 30 '14 at 14:48

3 Answers3

1

If the pre-calculated list of ints is not changing (read-only data) and threads are not modifying it than it is perfectly fine to search with Binary Search without synchronization. Threads may become slow because of synchronization problems. Hash can be sometimes faster than the Binary Search but it is more useful for larger arrays. Try first with binary search.

Also you can see: Is it wise to access read-only data from multiple threads simultaneously?

Community
  • 1
  • 1
Baj Mile
  • 750
  • 1
  • 8
  • 17
1

If you're doing search for a different item in each thread, then rather than worrying for bank conflicts, you need to worry about thread divergence, as binary search requires branching. Some of that can be mitigated using the select function.

You could be better off using other algorithms, such as interpolation search, which can find the item in fewer hops (essentially the decision where to look next is more expensive than in binary search, but if your search data is in the global memory, you can hide a lot of processing (about 20 instructions) under the memory latency).

I was solving a similar problem recently: Binary search with hint.

A simplified algorithm looks like this:

__global const _TyIndex *upper_bound(__global const _TyIndex *begin,
    __global const _TyIndex *end, const _TyIndex elem)
{
    while(begin != end) {
        __global const _TyIndex *mid = begin + (end - begin) / 2;

#if 0
        if(!(elem < *mid))
            begin = mid + 1; // look to the right
        else
            end = mid; // look to the left
#else // 0
        bool b_right = !(elem < *mid);
        begin = (__global const _TyIndex *)select((intptr_t)begin, (intptr_t)(mid + 1), b_right);
        end = (__global const _TyIndex *)select((intptr_t)mid, (intptr_t)end, b_right); // c ? b : a
#endif // 0
    }

    return begin;
}

This is using select() twice rather than branching. You can compare the performance by changing the #if 0 to #if 1. Note that e? a : b does imply a branch so using that does not help.

Community
  • 1
  • 1
the swine
  • 10,713
  • 7
  • 58
  • 100
  • Nice, this is what I was looking for. Just to make sure I understand: interpol is better than binary on GPUs because the cost of a more complex prediction is less than the cost of accessing a global memory location. – Xcelled Nov 18 '14 at 07:10
  • Yes, the way GPUs work is also called "deep pieplines", it means that executing an instruction is not done at once (as is usual e.g. in RISC architectures), rather there are many stages of execution, so a single instruction takes a very long time (many clock cycles) to execute. But that does not make it slow, because each stage of this pipeline can work on a different instruction (so throughput is the same, latency is longer). – the swine Nov 18 '14 at 10:53
  • In analogue, accessing memory takes a lot of time, but that does not make the GPU idle, as there are other function units that can still execute code, there are potentially sleeping threads that can be woken up, etc (although part of this is also out of order execution, not just deep pipelining). In effect, each memory access has a potential to hide some non-memory-access work (computation), so it would be a shame not to use it. – the swine Nov 18 '14 at 10:54
  • Care to provide some pseudocode for that optimised binary search using `select`? – Jonno_FTW Mar 08 '17 at 03:42
  • 1
    @Jonno_FTW there you go – the swine Mar 10 '17 at 15:21
0

It sounds like you are doing the binary search in each kernel as part of other work. It's better to find the result of the search first and pass the result to the kernel as parameter.

In general binary search is a logn algorithm and shouldn't be very slow unless you look for the value in really really large list. Still it's a waste of resources if you do the same search in each kernel. And if you want to parallelize the search itself it's still inefficient because you'd only have 2 cores executing kernels at each level/iteration of the algorithm. Add to this any other main program -> opencl overheads. It would be better to go with a linear search, splitting the whitelist on as many parts as the number of kernels. It might take longer to find the value in a single part of the list than a binary search but you'd have less overhead from passing values between the main program and the kernels.

stan0
  • 11,549
  • 6
  • 42
  • 59
  • I've clarified my question a bit, please see if that changes your answer. – Xcelled Jul 28 '14 at 13:35
  • @stan0 Well, formally, binary search is `O(log N)` and your linear search is `O(N / T)` where `T` is the number of threads. So it is really only faster if `T > N / log(N)`, and that is not counting the reduction of the results to get the final yes/no (or index). There was a paper by Merill / Garland on parallel tree traversal on GPU, not sure but I think that could be bent to perform e.g. parallel [interval halving](http://en.wikipedia.org/wiki/Bisection_method). – the swine Nov 17 '14 at 18:14