CUDA - how bad is the performance penalty for copying entire class of managed memory

Question

I have some classes that derive from a managed memory allocator, so for example:

/* --------- managed is from https://devtalk.nvidia.com/default/topic/987577/-thrust-is-there-a-managed_vector-with-unified-memory-do-we-still-only-have-device_vector-cuda-thrust-managed-vectors-/

------------ it overwrites the new operator, doing cudaMallocManaged and then casting */

class Cell : public Managed {
    int a;float b;char c; // say ~50 fields
}

Now, say i have an array of 100,000 Cell objects, and want to send to some global function, that uses only a small set (say 5-10) of the fields to do some computation.

The easiest way will be sending the entire array of the cell objects. It does, however, copy a lot of unused data.

A more tight approach is to allocate device arrays of only the needed 5-10 fields, copy the values and send them to the global function. It's a bit annoying, since if the global function body needs some other fields from the cell class, its signature has to be re-written to accepts the new arrays.

My question - in general, how bad is the performance penalty for using the easiest approach?

Thanks!

What do you mean by "sending the entire array"? If you are using managed memory, you are not "sending" anything. The driver and device are coordinating ad-hoc access by the device across the PCI-E interface. — talonmies, Dec 13 '17 at 13:25
"sending" - passing the array of the "Cell*" objects to the global function. I don't know (and I guess its implementation dependant ?) what goes on under the hood. — danwanban, Dec 13 '17 at 21:27
Again, passing a pointer to a kernel as an argument is pass by value, there is no data copying or transferring of the underlying array of structures. You are only passing a 64 bit address. — talonmies, Dec 13 '17 at 21:30

score 1 · Answer 1 · answered Dec 14 '17 at 21:16

How managed memory is handled very much depends on the compute capability of you device. Pascal (6.x) and later will demand-page in only those pages that are accessed.

Devices of lower compute capability will normally transfer the entire set of managed memory, regardless of how much of it is accessed, or even whether it is accessed at all. However you can explicitly declare the memory regions to transfer on a per-stream basis using cudaStreamAttachMemAsync(). This allows you to limit the amount of data transferred without having to change the allocation or data structure at all.

CUDA - how bad is the performance penalty for copying entire class of managed memory

1 Answers1