It seems that most tutorials, guides, books and Q&A from the web refers to CUDA 3 and 4.x, so that is why I'm asking it specifically about CUDA 5.0. To the question...
I would like to program for an environment with two CUDA devices, but use only one thread, to make the design simple (specially because it is a prototype). I want to know if the following code is valid:
float *x[2];
float *dev_x[2];
for(int d = 0; d < 2; d++) {
cudaSetDevice(d);
cudaMalloc(&dev_x[d], 1024);
}
for(int repeats = 0; repeats < 100; repeats++) {
for(int d = 0; d < 2; d++) {
cudaSetDevice(d);
cudaMemcpy(dev_x[d],x[d],1024,cudaMemcpyHostToDevice);
some_kernel<<<...>>>(dev_x[d]);
cudaMemcpy(x[d],dev_x[d],1024,cudaMemcpyDeviceToHost);
}
cudaStreamSynchronize(0);
}
I would like to know specifically if cudaMalloc(...)
s from before the testing for persist even with the interchanging of cudaSetDevice()
that happens in the same thread. Also, I would like to know if the same happens with context-dependent objects such as cudaEvent_t
and cudaStream_t
.
I am asking it because I have an application in this style that keeps getting some mapping error and I can't find what it is, if some missing memory leak or wrong API usage.
Note: In my original code, I do check every single CUDA call. I did not put it here for code readability.