I have some classes that derive from a managed memory allocator, so for example:
/* --------- managed is from https://devtalk.nvidia.com/default/topic/987577/-thrust-is-there-a-managed_vector-with-unified-memory-do-we-still-only-have-device_vector-cuda-thrust-managed-vectors-/
------------ it overwrites the new operator, doing cudaMallocManaged and then casting */
class Cell : public Managed {
int a;float b;char c; // say ~50 fields
}
Now, say i have an array of 100,000 Cell objects, and want to send to some global function, that uses only a small set (say 5-10) of the fields to do some computation.
The easiest way will be sending the entire array of the cell objects. It does, however, copy a lot of unused data.
A more tight approach is to allocate device arrays of only the needed 5-10 fields, copy the values and send them to the global function. It's a bit annoying, since if the global function body needs some other fields from the cell class, its signature has to be re-written to accepts the new arrays.
My question - in general, how bad is the performance penalty for using the easiest approach?
Thanks!