James Beilby's answer is the right direction but is a little bit out of the path:
Depending on implementation, it could be faster or slower because opencl doesn't force providers to use on-chip or off-chip memories but AMD is very good at OpenCL on price/performance dimension so I'll give some numbers about it.
Private memory in AMD implementation, is fastest(smallest latency,highest bandwidth like 22 TB/s for a mainstream gpu).
Here in appendix-d:
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

you can see register file, LDS, constant cache and global those are used for different name spaces when there is enough space for themselves. For example, register file has 22 TB/s and only about 300kB per compute unit. This has less latency and more bandwidth than LDS which is used for __local memory space. Total LDS size is even less than that (per compute unit).
If going from local to private doesnt do good, you should decrease local thread group size from 256 to 64 for example. SO more private registers availeable per thread.
So for this example AMD gpu, local memory is 15 times faster than global memory, private memory is 5 times faster than local memory. If it doesn't fit in private memory, it spills to global memory so only L1-L2 cache can help here. If data is not re-used much, no point of using private registers here. Just stream from global to global if only used once.
For some smartphone or a cpu, it could be very bad to use private registers because they could be mapped to something else.