In CUDA C Programming Guide, there is a part that says:
Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).
If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.
I am using a Java wrapper to use CUDA in my code (JCuda). I have defined my own float3
equivalent in Java (which is just a float[]
array of interleaved x, y and z elements).
My question is, since the float3
that I have defined occupies 3 x sizeof(float) = 12 bytes
and 12 bytes is not equal to the length of a word that CUDA fetches, should I manually add a padding element at the end and make it 16 bytes?
As a side question which is very related:
My kernel requires a pointer to float3
data, thus when I call it from Java, I pass it the float[] data that I have which contains all float3
elements in the Java side. Right now that my java float3
's are not aligned, am I processing wrong values? I'm asking because in another part of the programming guide it says:
Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc()or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.
So does this mean that when my data are not aligned and I request a certain offset in that data, I am fetching wrong values?
Thanks in advance for the answers :-)