0

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?

The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.

I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!

user873275
  • 136
  • 3
  • 8

2 Answers2

2

No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.

If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.

Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks @Robert. BTW, I saw you work for NVIDIA, so I'm just wondering if I can add this as a feature request? i.e. I like the convenience of `cudaMallocManaged`, but I also want to control on which GPU the memory should be allocated to, and this is the only manual control I need from cudaMallocManaged. And actually the very existence of cudaMemPrefetchAsync demonstrates such need. So instead of prefetch after malloc a managed memory, why not allow programmer direct specify the gpu id to allocate from the very begining? – user873275 Sep 03 '22 at 17:15
  • File such a request using [this process](https://forums.developer.nvidia.com/t/how-to-report-a-bug/67911). managed memory is **not** allocated on a particular GPU or processor. It is a migratable allocation that can be present on any processor in the system. – Robert Crovella Sep 03 '22 at 17:20
0

From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):

By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.

So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.

Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.

In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).

The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.

In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

paleonix
  • 2,293
  • 1
  • 13
  • 29