Most of the algorithms for parallel reduction uses shared(local) memory.
Nvidia,AMD, Intel and so on.
But if devices has doesn't have shared(local) memory.
How can I do it?
If i use same algorithms but store temporary value on global memory, is it gonna be work fine?