CUDA alignment requirement: should I change my data structures?

Question

In CUDA C Programming Guide, there is a part that says:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.

I am using a Java wrapper to use CUDA in my code (JCuda). I have defined my own float3 equivalent in Java (which is just a float[] array of interleaved x, y and z elements).
My question is, since the float3 that I have defined occupies 3 x sizeof(float) = 12 bytes and 12 bytes is not equal to the length of a word that CUDA fetches, should I manually add a padding element at the end and make it 16 bytes?

As a side question which is very related:
My kernel requires a pointer to float3 data, thus when I call it from Java, I pass it the float[] data that I have which contains all float3 elements in the Java side. Right now that my java float3's are not aligned, am I processing wrong values? I'm asking because in another part of the programming guide it says:

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc()or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.

So does this mean that when my data are not aligned and I request a certain offset in that data, I am fetching wrong values?

Thanks in advance for the answers :-)

About the first part, it depends on your device compute capability and existence of cache in your device. Have a look at [this](http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/) post that examines different scenarios using different devices. — Farzad, Feb 12 '14 at 16:48
About the second part, I personally don't understand how your elements are not aligned. Aren't they placed in memory in the form of {a0.x,a0.y,a0.z,a1.x,a1.y,a1.z, ...}? — Farzad, Feb 12 '14 at 16:51
Mersi @Farzad jan, however, the CUDA C Programming Guide, specifically says that an access to an address that is not a multiple of the word size, will not be fully coalesced. This is my main concern here as it seems that reading my input data is not fully coalesced. — Maghoumi, Feb 12 '14 at 19:18
Regarding your second comment, they are placed exactly in the way you mentioned, however, it seems that currently, they do not meet the word alignment requirement that the documentation has specified. I just find the wording of the quoted paragraph a little bit confusing, that's why I'm asking about it here again... — Maghoumi, Feb 12 '14 at 19:20
Correct data access requires data to be naturally aligned. If your float3 is simply a struct of floats, the struct needs to be aligned to a four-byte boundary, i.e. sizeof(float). For performance reasons, you may want to use a built-in aligned vector type like float4 instead of your custom float3 type, this can improve memory throughput but will increase the alignment requirement to 16 bytes, i.e. sizeof(float4). — njuffa, Feb 12 '14 at 22:04
@njuffa if you present your comments in an answer, I would upvote it. — Robert Crovella, Feb 16 '14 at 21:19

score 4 · Answer 1 · edited May 23 '17 at 11:52

There are two aspects to this question:

What are the requirements for correct memory access ?
How can one optimize the throughput of memory accesses ?

To the first item: As the CUDA documentation points out, in order to load and store data correctly, the address of each access must be evenly divisible by the size of the access. For example, an object of type float has a size of four bytes, so it must be accessed at an address that is a multiple of four. If the alignment requirement is violated, data will be read and stored incorrectly, that is, the data becomes garbled.

For built-in non-compound types, the required alignment is equal to the size of the type, this is called "natural alignment". For user-defined compound types, such as structs, the required alignment is the alignment of the largest component type. This applies to the user-defined float3 type in the question, which has a four-byte alignment requirement as the largest component is of type float. Programmers can increase the required alignment by use of the __align__() attribute. See: How to specify alignment for global device variables in CUDA

For built-in compound types, CUDA requires alignment that is equal to the size of the compound type. For example, objects of types int2 and float2 must be aligned on a 8-byte boundary, while objects of types float4 and double2 must be aligned to a 16-byte boundary.

To the second item: The GPU is able to perform aligned 4-byte, 8-byte, and 16-byte accesses, and in general, the wider each access the higher the overall memory throughput. A vastly simplified view of the GPU hardware is that there are fixed-sized queues inside the hardware that track each memory access. The wider each memory access, the larger the total amount of bytes that can be queued up for transfer, which in turn improves latency tolerance and overall memory throughput.

For this reason I would suggest switching, if possible, from a custom float3 type to the built-in float4 type. The former will cause data to be loaded in chunks of four bytes, while the latter allows data to be loaded in chunks of 16 bytes.

score 1 · Accepted Answer · answered Feb 16 '14 at 21:32

1

So after some trial and error, it seems that using padded float3's definitely improves the performance of the program. Thus I decided to use both padded float3's and strided memory (using cudaMallocPitch).

However, I still have not heard a good answer for the second part of my question.

answered Feb 16 '14 at 21:32

Maghoumi

3,295
3
33
49

1

Actually I believe the last comment by @njuffa did answer your "second part" if by that you mean "As a side question ...". Your own `float3` structure, as an array of `float`, only needs to be aligned on a `float` boundary, for proper access. Note that your `float3` structure is niether an 8 or 16 byte word, so the entire excerpt you have shown is irrelevant. – Robert Crovella Feb 16 '14 at 22:04
@RobertCrovella I think I am a bit confused. Is the 2nd excerpt talking about the exception that can arise if you try accessing unaligned data? (for example accessing a float in the middle of the memory where the starting address is not the multiple of the size of the float)? – Maghoumi Feb 16 '14 at 22:45
It's referring to the specific case of reading an 8 or 16 byte word. A `float` is a 4-byte word, and your `float3` struct is 12 bytes. The excerpt is not applicable to either `float` or `float3`. A `float2` could be read as an 8-byte word. In this case, to get a proper 8-byte read, it must be read on a `float2` boundary (ie. a naturally aligned boundary). – Robert Crovella Feb 16 '14 at 23:05
@RobertCrovella So now that I have a `float4` which is 16 bytes, is it applicable to me? Am I fetching wrong values? – Maghoumi Feb 16 '14 at 23:11
1

You should make sure your `float4` reads are aligned on 16-byte (i.e. `float4`, naturally aligned) boundaries. Without actually seeing your code and following it all the way through the toolchain, yes, it applies to you. – Robert Crovella Feb 16 '14 at 23:15
@RobertCrovella Thank you for your elaboration. As I mentioned earlier, I am passing a `float[]` array (of interleaved `float` elements) to cuda and I have defined the beginning address of the array as a `float4` pointer. Any other read that I am doing is fetching `float4` values from this array as if this array was a `float4` array in the first place. Any tips on how I can actually see if there is a problem with my code? – Maghoumi Feb 16 '14 at 23:23
1

As long as you are fetching `float4` values at a valid `float4` offset from the beginning of the `float` array, there should be no issue, assuming the array starting point is at a boundary provided by `cudaMalloc`. If you want, as a test, before dereferencing any `float` index into said `float[]` array, you can check to see if the lowest 2 bits of the index are zero. If they are, it's a valid `float4` index into that array. – Robert Crovella Feb 16 '14 at 23:49
@RobertCrovella OK that makes sense. I can verify that they are multiples of 4 therefore the case discussed in the excerpt is not applicable to me. Thanks again for the clarification – Maghoumi Feb 17 '14 at 01:37

CUDA alignment requirement: should I change my data structures?

2 Answers2