21

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.

The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.

It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?

user1096294
  • 829
  • 2
  • 10
  • 19
  • The results are not consistent across problems and cases. But yours may be right. OpenCL works asynchronous by default, so, if you use CUDA as it is(not-asynchronous), probably it will give a slight slower performance than OpenCL. – DarkZeros May 06 '14 at 15:45
  • I already heard that the NVIDIA's implementation of OCL is based on the CUDA one. However, each time I tried to find some sources, I never found any evidence of that. I concluded that it is a mistake based on the fact that CUDA is wrongly used as the language name while in reality it is a "parallel computing platform and programming model" (Compute Unified Device Architecture) as per Wikipedia. Hence when you see a slide from NVIDIA showing that under OCL there is smth called CUDA that's the GPU which is a CUDA chip. Could you post you source if you have any? I'd like to know for sure. – CaptainObvious May 06 '14 at 16:23
  • 1
    OpenCL and CUDA are completely different. They both use the same HW in the end. But just as OpenGL and DirectX, one is not under the other or viceversa. Main points to state this is that the libraries are different, the compilers are different, and the execution model is different as well. Some parts might be common, but the majority is not. – DarkZeros May 06 '14 at 16:25
  • @DarkZeros I run OpenCL in the synchronous mode. Also, I believe I have seen is somewhere that the NVIDIA's implementation of OpenCL does not allow for asynchronous queues anyway. – user1096294 May 06 '14 at 18:07
  • Depends what you mean by synchronous. Launching a kernel is always asynchronous in OCL, there is no way you can launch and wait for it. While in CUDA is completely synchronous. The internal behavior is different. – DarkZeros May 07 '14 at 08:05
  • @DarkZeros You are wrong here, you can wait for the execution of your kernel launch to finish using events. I thought that by asynchronous you meant the queue to which I submit my jobs - these can be both in-order or out-of-order. The latter not supported by NVIDIA. – user1096294 May 07 '14 at 10:23
  • @user1096294 Even if you can wait for it, the launch itself is not in sync. The next CPU instruction in the host will not be executed after the kernel has finished but in parallel with it. I understand that you can emulate the behavior, but internally the execution model is different, thus producing small speed differences depending on the kernel/IO/etc. – DarkZeros May 07 '14 at 10:57
  • 2
    If you are on a 64-bit platform, my first guess would be that the OpenCL kernel is benefiting from the lower register pressure since it can be 32-bit. If the OpenCL toolchain permits, you should decompile the two and compare the microcode. – ArchaeaSoftware May 07 '14 at 15:16
  • @ArchaeaSoftware Can you evaluate on the 32-bit part? I am on a 64-bit platform but I do calculations on double precision numbers. – user1096294 May 07 '14 at 15:49
  • 3
    NVIDIA OpenCL implementation is 32-bit and doesn't conform to the same function call requirements as CUDA. CUDA runtime applications compile the kernel code to have the same bitness as the application. On a 64-bit platform try compiling the CUDA application as a 32-bit application. Your use of double has nothing to do with the bitness of the application or kernel code. It is possible to get the PTX code from a OpenCL kernel so you can compare it against the CUDA code. At this time you cannot get the SASS code for OpenCL kernels. – Greg Smith May 08 '14 at 06:05
  • @GregSmith Please correct me if I get this wrong, you are saying that because I am on a 64-bit platform, when I compile my CUDA kernels it might treat some datatypes to be larger, than if I compiled it on a 32-bit platform? My code uses integers for memory indices and doubles for data. Can pointer calculations take up more registers in the case of CUDA, increasing register pressure? On a side, if NVIDIA OpencL is 32-bit, my OpenCL kernel will not be able to access all the memory available on my Tesla K40? – user1096294 May 08 '14 at 19:40
  • Yes, when compiling for a 64-bit machine, the pointers are 64-bit, which effectively doubles the register usage (therefore possibly increasing register pressure) as compared to the same code compiled for 32-bit. – Robert Crovella May 09 '14 at 14:47
  • @RobertCrovella Thank you Robert. How do OpenCL kernels access more than 2^32 bytes of memory? – user1096294 May 09 '14 at 16:52
  • 2
    Are the numeric answers you get with OpenCL and CUDA identical? If not then the kernels aren't doing the same computations. – Tim Child Feb 20 '15 at 22:42
  • 3
    I know that this question is already a year old and I am probably pointing out the obvious, but in case that you are using CUDA runtime api, beware that CUDA has to initialize the driver before running your kernel code, which may in turn skew your timings compared to OpenCL... Try running some dummy iterations of your kernel before doing the timings – jcxz May 20 '15 at 18:49

1 Answers1

27

Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)

This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.

In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.

The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.

To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Shebla Tsama
  • 906
  • 8
  • 8
  • 3
    Almost all my GPU kernels are compute bound, I don't know what you're doing – étale-cohomology Jul 22 '21 at 23:57
  • Of course it’s very normal to have compute bound kernels, but as soon as you start accessing memory within your kernel (as in the above case) the physical path to memory quickly becomes saturated (due to the thousands of threads sending memory requests). This is why GPU makers focus so heavily on optimizing memory bandwidth hardware. – Shebla Tsama Nov 25 '21 at 04:27