I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:
template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
using vector_type = int2;
vector_type* src2 = reinterpret_cast<vector_type*>(src);
vector_type* dest2 = reinterpret_cast<vector_type*>(dest);
//This copy kernel is only correct when size%sizeof(vector_type)==0
auto numElements = size / sizeof(vector_type);
for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
dest2[id] = src2[id];
}
}
I also calculated the number of blocks required to reach 100% occupancy like so:
THREADS_PER_BLOCK = 256
Multi-Processors: 4
Max Threads per Multi Processor: 2048
NUM_BLOCKS = 4 * 2048 / 256 = 32
My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:
bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s
So my questions are:
Why is there a speed difference?
Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?