0

Is there a initial performance hit when using local memory? I was converting my existing kernel that uses global memory and on successful conversion I saw the performance degraded. Obviously you may think I might not have used it correctly and I might even agree and find some more optimizations. But that is not the question here.

As a side experimentation I used the same kernel using global memory as is with no access to local memory. and then all I did was passed in a kernel parameter with local memory, some 1024 integers. and here I saw this kernel execution took almost twice as long. So does the allocation of local memory itself cause some initial performance hit? Has anybody seen this and maybe have an explanation?

[UPDATE] Thank you all for your comments and answers. I tried to write a separate test kernel to see if this behavior was repeatable. It wasn't. I found a post Is private memory slower than local memory? that mentions excess use of private memory may result in spill over to global memory and as a result may slow down the kernel execution. It seems this may be specific to nVidia cards, I wonder what happens on AMD cards. Could it be that allocation of local memory suddenly caused the private memory to spill over to make space for local memory? I am looking at my implementation from that angle now, unless anyone of you suggests otherwise. Is there any documentation or book that has such mention that you guys may be aware of?

Thanks again.

Community
  • 1
  • 1
  • How are you using global memory? How have you setup your work groups to share local memory? Are you using MEM_FENCE commands? Have you accidentally coded the kernel so that all threads are copying global to local instead of one work item moving one value from global to local? – Austin Mar 03 '14 at 04:55
  • @Austin, at this point my question is not on using local memory, although it is possible I may need help at a later stage. Its more on creating local memory buffer and seeing a slow down(2nd paragraph in my question). In my experimentation I used only global memory and have not referenced the local memory in any way. I create local memory buffer by passing as kernel parameter and by just doing that the kernel takes twice as long. Have you seen such behavior. Is there any explanation why? Thanks. – user3371762 Mar 03 '14 at 06:22
  • What hardware are you using and what is the local work size you are using? It is possible that using local memory reduces the number of workgroups that can be scheduled to a compute unit as they all have to fit within the available local memory. – chippies Mar 03 '14 at 06:55
  • @chippies, I have nVidia GTX 260. It reports 16k local memory. talking specifically about my experimentation (2nd para in my question), I did not schedule any workgroup. Although at this point I do not remember if I had set the localsize to 1 or set it to NULL (and accidentally let OpenCL decide the workgroup size). I'll have to rerun using both values and see if there's any difference. Thanks. – user3371762 Mar 03 '14 at 07:17
  • Depends on the algorithm, not all the processes can be speed up with local memory. In fact, some kernel may be better using global memory instead. – DarkZeros Mar 03 '14 at 10:16
  • It is really hard to see without looking at your code. I'd guess if you were positive that the code was similar, that you are doing multiple copies from global to local and that is causing the slowdown. – Austin Mar 03 '14 at 16:32

1 Answers1

0

Performance hit may be imposed by using local work group of non-optimal size or synchronizing of WI within WG.

Reading into local memory itself doesn't introduce any performance hit - it has same order of speed as reading into private memory (both placed on chip).

Also, check if your data fits into local memory size, as it's usually has small size.

Roman Arzumanyan
  • 1,784
  • 10
  • 10
  • So how to determine the optimal size? Would it just be power to 2 or multiple of something? – user3371762 Mar 04 '14 at 05:41
  • It depends on hardware architecture. Usually, gpu is compound of clusters, which contain stream processors (ALU). Number of such processors or multiple is optimal - WI will share resources on same piece of chip. E g for AMD cards optimal size is multiple of 64 for WLIV architecture – Roman Arzumanyan Mar 04 '14 at 07:26