0

It seems that most tutorials, guides, books and Q&A from the web refers to CUDA 3 and 4.x, so that is why I'm asking it specifically about CUDA 5.0. To the question...

I would like to program for an environment with two CUDA devices, but use only one thread, to make the design simple (specially because it is a prototype). I want to know if the following code is valid:

float *x[2];
float *dev_x[2];

for(int d = 0; d < 2; d++) {
    cudaSetDevice(d);
    cudaMalloc(&dev_x[d], 1024);
}

for(int repeats = 0; repeats < 100; repeats++) {
    for(int d = 0; d < 2; d++) {
        cudaSetDevice(d);
        cudaMemcpy(dev_x[d],x[d],1024,cudaMemcpyHostToDevice);

        some_kernel<<<...>>>(dev_x[d]);

        cudaMemcpy(x[d],dev_x[d],1024,cudaMemcpyDeviceToHost);
    }
    cudaStreamSynchronize(0);
}

I would like to know specifically if cudaMalloc(...)s from before the testing for persist even with the interchanging of cudaSetDevice() that happens in the same thread. Also, I would like to know if the same happens with context-dependent objects such as cudaEvent_t and cudaStream_t.

I am asking it because I have an application in this style that keeps getting some mapping error and I can't find what it is, if some missing memory leak or wrong API usage.

Note: In my original code, I do check every single CUDA call. I did not put it here for code readability.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • You don't have to call cudaStreamSynchronize() because the cudaMemcpy() calls are synchronous. Also, note that after your loop terminates, device 1 will be current to the CPU thread. – ArchaeaSoftware Feb 13 '13 at 14:55

1 Answers1

1

Is this just a typo?

for(int d = 0; d < 2; d++) {
    cudaSetDevice(0);  // shouldn't that be 'd'
    cudaMalloc(&dev_x, 1024);
}

Please check the return value of all API calls!

Tom
  • 20,852
  • 4
  • 42
  • 54
  • 1
    Yes it should be OK. regions allocated by cudaMalloc, streams, and events are all specific to the device they were created on (the most recent `cudaSetDevice()` call). So you should be sure to use only those items that are pertinent to the device you are accessing. Additional info [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multi-device-system). Additionally `x[d]` since it lives on the host, does not need to be indexed per device unless you want to. – Robert Crovella Feb 12 '13 at 18:43
  • Do the same apply to objects that are context-dependent such as Streams and Events? – Ricardo Inacio Feb 12 '13 at 19:19
  • If it's not a typo and all API calls are returning success, then the information in your question does not appear sufficient to identify your problem since it looks ok (albeit with unnecessary `cudaStreamSynchronize()`). Can you try to create a repro? You could also try running with cuda-memcheck to look for OOB errors and leaks (with leakcheck option). – Tom Feb 13 '13 at 15:27