I found this parallel reduction code from Stanford which uses shared memory.
The code is an example of 1<<18
number of elements which is equal to 262144 and produces correct results.
Why do I get the correct results for certain numbers of elements and for other numbers of elements, like 200000 or 25000 I get different, unexpected results?
It looks to me as if it's always appointing the needed thread blocks.