2

I need do many comparsions in opencl programm. Now i make it like this

int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
    for (int i = 0; i<size;i++){
         if(a[i] != b[i])return 0;
    }
    return 1;
}

How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!

lebron2323
  • 990
  • 2
  • 14
  • 29

1 Answers1

3

I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.

However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.

See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1 It is a bit old but their concepts are not so old.

PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".

Moises Viñas
  • 1,073
  • 2
  • 7
  • 9