0

I keep getting "invalid device ordinal" when trying to set the preferred location of managed memory to GPU #0:

CUDA_ERR_CHECK(cudaMemAdvise(deviceMemoryHeap.pool, size,
    cudaMemAdviseSetPreferredLocation, 0));

The only thing that works is cudaCpuDeviceId. So, how to specify the GPU id?

EDIT Adding a simple example:

#define CUDA_ERR_CHECK(x)                                  \
    do { cudaError_t err = x; if (err != cudaSuccess) {    \
        fprintf(stderr, "CUDA error %d \"%s\" at %s:%d\n", \
        (int)err, cudaGetErrorString(err),                 \
        __FILE__, __LINE__);                               \
        exit(1);                                           \
    }} while (0);

#include <cstdio>

template<typename T>
__global__ void kernel(size_t* value)
{
    *value = sizeof(T);
}

int main()
{
    size_t size = 1024 * 1024 * 1024;

    size_t* managed = NULL;
    CUDA_ERR_CHECK(cudaMallocManaged(&managed, size, cudaMemAttachGlobal));
    CUDA_ERR_CHECK(cudaMemAdvise(managed, size,
        cudaMemAdviseSetPreferredLocation, 0));
    kernel<double><<<1, 1>>>(managed);
    CUDA_ERR_CHECK(cudaGetLastError());
    CUDA_ERR_CHECK(cudaDeviceSynchronize());
    CUDA_ERR_CHECK(cudaFree(managed));
    size_t* memory = NULL;
    CUDA_ERR_CHECK(cudaMalloc(&memory, size));
    kernel<double><<<1, 1>>>(memory);
    CUDA_ERR_CHECK(cudaGetLastError());
    CUDA_ERR_CHECK(cudaDeviceSynchronize());
    CUDA_ERR_CHECK(cudaFree(memory));

    return 0;
}

Throws an error:

$ make
nvcc -arch=sm_30 managed.cu -o managed
$ ./managed 
CUDA error 10 "invalid device ordinal" at managed.cu:24

CUDA 8.0

My goal is to get rid of the giant cudaLaunch call latency, which happens only in case of managed memory kernel launch:

cudaLaunch latency in case of argument in managed memory

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Dmitry Mikushin
  • 1,478
  • 15
  • 16
  • 2
    Do you have a multi-GPU system? If so, are the GPUs in a topology that supports peer-to-peer access? For questions like this ("why isn't this code working?") you are supposed to provide an [mcve]. In this case, we need to know about the system you are running on. Unified memory in a multi-GPU system where the GPUs are not in a full peer-to-peer mesh will allocate managed memory requests using zero-copy memory i.e. host memory (refer to programming guide). Such a scenario is workable, but the memory/data cannot migrate, and so choosing a GPU device ID in that case would be invalid. – Robert Crovella Dec 09 '16 at 14:16
  • 2
    As an example, I created a test case for a 4 GPU system. I have one such system where the GPUs are in a P2P capable mesh. The full session output is [here](http://pastebin.com/mUnRDAAA) (no errors). I have another such system where the 4 GPUs are not all P2P capable. The full session output for that is [here](http://pastebin.com/Nspz0qg3) (showing invalid device ordinal errors). I suspect this is what is happening in your case. The programming guide for UM [documents the use of zero-copy memory](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory) in this case. – Robert Crovella Dec 09 '16 at 14:25
  • The behavior may also vary depending on device type and CUDA version, but since you are using `cudaMemAdvise` the CUDA version must be 8.0 (at this time). – Robert Crovella Dec 09 '16 at 14:26
  • Added a simple test case – Dmitry Mikushin Dec 09 '16 at 16:19
  • So, `cudaMemAdviseSetPreferredLocation` only makes sense for MultiGPU setups. Can we still by any means use GPU #0 as the primary managed memory location on a single-GPU system? – Dmitry Mikushin Dec 09 '16 at 16:25
  • I didn't say that. It could be useful in a single-GPU Pascal setup. The `cudaMemAdvise` API really applies mainly to Pascal devices that can do demand paging, and where the runtime can do demand paging and make runtime decisions about where a memory page should reside or start. In the case of your cc3.0 Kepler device, this isn't possible. One of the reasons for the long `cudaLaunch` call latency in your example is the large allocation that has to be transferred from host to device at kernel launch. I don't know of a way to get around this for UM on pre-Pascal devices. – Robert Crovella Dec 09 '16 at 16:40
  • In your case, movement of the 1GB allocation will take approximately .16s in your PCIE Gen2 setup. If you profile with UM enabled, you should be able to see the transfer. If you reduce `size` to e.g. 1Kbyte, then I think you'll see the overall launch latency reduce. – Robert Crovella Dec 09 '16 at 16:43
  • To be clear, this particular test case you have shown here could look very much different on a Pascal GPU. – Robert Crovella Dec 09 '16 at 16:59
  • OK, I see, so low chances to prefetch data ( separate transfer from cudaLaunch i any way) to GPU memory on pre-Pascal devices. – Dmitry Mikushin Dec 09 '16 at 16:59

1 Answers1

4

The error seems to be originating from a missing device capability. As the CUDA documentation for the cudaMemAdvise function states:

If device is a GPU, then it must have a non-zero value for the device attribute cudaDevAttrConcurrentManagedAccess.

You should call the following code to make sure that the device is OK for concurrent managed use:

int device_id = 0, result = 0;
cudaDeviceGetAttribute (&result, cudaDevAttrConcurrentManagedAccess, device_id);
if (result) {
    // Call cudaMemAdvise
}
Tal Ben-Nun
  • 439
  • 3
  • 7