I only found a remark that local memory is slower than register memory, the two-per-thread types.
Shared memory is supposed to be fast, but is it faster than local memory [of the thread]?
What I want to do is kind of a median filter, but with a given percentile instead of the median. Thus I need to take chunks of the list, sort them, and then pick a suitable one. But I can't start sorting the shared memory list or things go wrong. Will I lose a lot of performance by just copying to local memory?