Is private memory slower than local memory?

Question

I was working on a kernel which had much global memory access per thread so I copied them to local memory which gave a speed up of 40%.

I wanted still more speed up so copied from local to private which degraded the performance

So is it correct that I think we must not use to much private memory which may degrade the performance?

A minimal runnable example and exact platform specs would be awesome :-) — Ciro Santilli OurBigBook.com, Mar 14 '17 at 08:39

score 15 · Answer 1 · answered Nov 06 '12 at 15:26

Ashwin's answer is in the right direction but a little misleading.

OpenCL abstracts the address space of variables away from their physical storage, and there is not necessarily a 1:1 mapping between the two.

Consider OpenCL variables declared in the __private address space, which includes automatic non-pointer variables inside functions by default. The NVidia GPU implementation will physically allocate these in registers as far as possible, only spilling over to physical off-chip memory when there is insufficient register capacity. This particular off-chip memory is called "CUDA local" memory, and has similar performance characteristics to memory allocated for __global variables, which explains the performance penalty due to register spill-over. There is no such physical thing as "private memory" in this implementation, only a "private address space", which may be allocated on- or off-chip.

The performance hit is not a direct consequence of using the private address space (or "private memory"), which is typically allocated in high performance memory. It is because, under this implementation, the variable was too large to be allocated on high performance registers, and was therefore "spilled over" to off-chip memory.

score 1 · Answer 2 · answered Jan 12 '17 at 00:12

(I know this is an old question, but the answers given aren't very accurate, and I saw conflicting answers elsewhere during Google searches.)

According to "Heterogeneous Computing with OpenCL" (Revised OpenCL 1.2 Edition):

Private memory is memory that is unique to an individual work-item. Local variables and nonpointer kernel arguments are private by default. In practice, these variables are usually mapped to registers, although private arrays and any spilled registers are usually mapped to an off-chip (i.e., long-latency) memory.

So, if you use a great deal of private memory, or use arrays in private memory, yes, it can be slower than local memory.

huseyin tugrul buyukisik · Answer 3 · 2017-02-08T12:56:16.203

James Beilby's answer is the right direction but is a little bit out of the path:

Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.

Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).

Here in appendix-d:

http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).

If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.

So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.

For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.

score 0 · Answer 4 · answered Mar 27 '12 at 08:54

0

In (GPU-like) OpenCL devices, the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache. The private memory for each thread is actually apportioned from off-chip global memory. This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance.

answered Mar 27 '12 at 08:54

Ashwin Nanjappa

76,204
83
211
292

if suppose lets take like a thread uses the global buffer value twice , than initially keep that value in private(register) reduces one global access giving performance . In these sense i am confused . Waiting for your replay. – Megharaj Mar 27 '12 at 09:34
Megharaj: Note that private memory is not the same as registers. Although, if there is register spilling, private memory might be used for that. – Ashwin Nanjappa Mar 27 '12 at 09:38
Thanks a lot for your answer(danyavadagalu). "This is far from from the PE and might have a latency of hundreds of clock cycles, thus degrading the read-write performance." why does it gives performance in this case below "if suppose lets take like a thread uses the global buffer value twice , than initially keep that value in private memory reduces one global access giving performance ". So what extent we must use private memory. Thank you. – Megharaj Mar 27 '12 at 09:43
"the local memory is on-chip and close to the processing elements (PE). It might be as fast as accessing L1 cache" thanks a lot for this ya this makes sense i ll look into this .thanks again:) – Megharaj Mar 27 '12 at 09:46
Sorry again to interrupt , I have written a median filter case which is faster than NVIDIA'S , In my program i copy from global memory to local memory(efficiently:)) i dont want to change loacal buffer values so i copy to private memory and perform operations gives performance. But using only local memory without copying to private degrades performance. But if performed by copying to private memory from local gives good performance. – Megharaj Mar 27 '12 at 09:52
Megharaj: I do not know the answer to that. Maybe you can post a different question on StackOverflow with details of the problem, the solution and sample code, so that someone knowledgeable can help. – Ashwin Nanjappa Mar 27 '12 at 10:31
thanks for the replay. In future hope i ll come up with the answers. – Megharaj Mar 27 '12 at 10:43
Megharaj: Could you accept this answer if it satisfies your Q? (Press the tick) – Ashwin Nanjappa Mar 27 '12 at 10:55
This is wrong, wrong wrong. If private memory were generally stored far from the processors every single OpenCL kernel would run at a crawl! Local primitive variables are stored in private memory. A statement like x = ++i would involve two round trips to off-chip memory! Primitive private data will almost certainly be in registers at all times. Private arrays will almost always be in registers also ... only if you use large private arrays might you hit a problem. – barneypitt Mar 29 '18 at 11:47

Is private memory slower than local memory?

4 Answers4

Linked