I would like first to confirm the following: The elementary global memory transaction to shared memory is either 32 bytes, 64 or 128 bytes, but only if the memory accesses can be coalesced. The latencies of the precedent transactions are all equal. Is that right?
Second question: If the memory reads can't be coalesced, each thread reads only 4 bytes (is that right?) will all threads memory accesses be made sequential?