1

I am solving a 2d Laplace equation using OpenCL. The global memory access version runs faster than the one using shared memory. The algorithm used for shared memory is same as that in the OpenCL Game of Life code.

https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/

If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.

Karthik G.M
  • 11
  • 1
  • 2

3 Answers3

2

If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.

Here's a summary of what it says:

Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.

Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.

In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.

Simon
  • 860
  • 7
  • 23
0

No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.

Community
  • 1
  • 1
Ani
  • 10,826
  • 3
  • 27
  • 46
0

Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.

However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.

Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).