Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

Question

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:

To compute on the GPU, I need to allocate memory accessible by the GPU. Unified Memory in CUDA makes this easy by providing a single memory space accessible by all GPUs and CPUs in your system. To allocate data in unified memory, call cudaMallocManaged(), which returns a pointer that you can access from host (CPU) code or device (GPU) code.

I found this both interesting (since it seems potentially convenient) and confusing:

returns a pointer that you can access from host (CPU) code or device (GPU) code.

For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?

In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).

Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.

So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does this work?

@RobertCrovella thank you! Aha so it ***is*** duplicated across RAM and VRAM, however you can actually exert some control over how that's done. — J.Todd, Sep 16 '20 at 15:10

score 2 · Accepted Answer · answered Sep 16 '20 at 15:13

So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?

Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.

If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?

If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.

Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?

No, this is managed by the runtime, not compile time.

Or do I have it all wrong? Are we not even talking VRAM? How does this work?

No you don't have it all wrong. Yes we are talking about VRAM.

The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

Wouldn't a compile time implementation be more efficient? You explain: "When the GPU attempts to actually access the data, it will cause the allocation to 'appear'" - but from my understanding, for the contents of that buffer to get populated into VRAM from RAM or CPU, it's costing time and memory bandwidth as the data moves from RAM across the PCIe to VRAM, so either it was already there, transferred as soon as it was allocated, or when the GPU attempts to access it, there's a delay as the data transfers. I'm surprised such a low level system would attempt to make such an 'optimization' — J.Todd, Sep 16 '20 at 15:35
You're making an assumption that such a thing is easy to discover (by the compiler) at compile-time. That isn't always/generally the case. managed memory doesn't handle every possible situation perfectly. But there are many ways to optimize its behavior, some of which are covered in [this blog](https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/). Yes, naive usage of managed memory can lead to performance issues, see [here](https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana/40011988#40011988) — Robert Crovella, Sep 16 '20 at 15:39

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

1 Answers1