cuda memory alignment

Question

In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general). When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like

struct pt{
int i;
int j;
int k;
}

even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...

Asked for position 0 of stack, stack only has 0 elements on it.

So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to

struct __align__(16) pt{
int i;
int j;
int k;
}

but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:

error: expected unqualified-id before numeric constant error: expected ‘)’ before numeric constant error: expected constructor, destructor, or type conversion before ‘;’ token

so, am I supposed to have two different definitions for host and device structures ???

Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.

For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

struct {
    int a;
    int b;
    int c;
    int d;
    float* el;    
} ;

 struct {
    int a;
    int b
    int c
    int d
    float* i;
    float* j;
    float* k;
} ;

Thank you in advance for any advice or hints

I think you are looking for this [http://stackoverflow.com/questions/6978643/cuda-and-classes][1], answered by @harrism himself. [1]: http://stackoverflow.com/questions/6978643/cuda-and-classes — eLRuLL, Oct 08 '12 at 10:09

score 32 · Accepted Answer · answered Oct 08 '12 at 10:21

There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.

First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.

#if defined(__CUDACC__) // NVCC
   #define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
  #define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
  #define MY_ALIGN(n) __declspec(align(n))
#else
  #error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif

So, am I supposed to have two different definitions for host and device structures?

No, just use MY_ALIGN(n), like this

struct MY_ALIGN(16) pt { int i, j, k; }

For example, how should the following two be aligned?

First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...

struct MY_ALIGN(16) {
  int a;
  int b;
  int c;
  int d;
  float* el;    
};

In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.

struct MY_ALIGN(16) {
  int a;
  int b
  int c
  int d
  float* i;
  float* j;
  float* k;
};

In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.

or, how should a structure with 6 floats be aligned?

Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.

or 4 integers?

No tradeoff here. MY_ALIGN(16).

again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.

Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.

Thank you very much for your answer. I was hoping at most for a link to an external reference, and this is much more, a full lesson on alignment. I am flattered. My code uses indeed structures of arrays. I use structures like pt (mentioned above) in a smaller scale, to facilitate the passing of arguments from within kernels to _ device _ functions that are called. And that is where they are like invisible when I try to query their values from cuda-gdb. — Panagiotis, Oct 08 '12 at 10:50
Glad to help. An upvote wouldn't go amiss. :) Not sure if this will help the cuda-gdb issue. In my experience the device code debugger doesn't always show all values--only the ones that are immediately in scope / active a the current paused code position. — harrism, Oct 08 '12 at 11:08
So, just in order to clear ideas: when I receive "Asked for position 0 of stack, stack only has 0 elements on it." by gdb... it means that gdb doesn't make the value available to me to query(?), or that the variable is not defined yet, and/or has no value appointed to it? I'm more concerned with what happens in the program execution itself rather than what I can see through gdb ofcourse. — Panagiotis, Oct 09 '12 at 08:34
I'm not familiar with that error. Did you compile your .cu file(s) with -G and -g in order to debug with cuda-gdb? — harrism, Oct 09 '12 at 10:27
Yes I did. I also used -O0 which is supposed to be implied in -G, but in several cases I have seen that this is not the case. — Panagiotis, Oct 09 '12 at 18:19
I think this error should be a separate question. Perhaps create an as-simple-as-possible repro code, and post it with a question about the error. cuda and cuda-gdb tags... — harrism, Oct 09 '12 at 22:39
years later, still an excellent answer! worth mentioning that Intel's compiler also uses `__declspec(align(n))`. the compiler defines `__INTEL_COMPILER`, so you you could add it to your `MSVC` one. And Clang defines `__clang__`, and uses GCC's version (`__attribute__((aligned(n)))`), so you could add it there. that pretty much covers all major (non-specialized i.e. ARM which I know nothing about) compilers I have to deal with ;) I've had trouble with Intel because they define a bunch of things you wouldn't expect. I don't remember the specifics, but the solution is to check for it first. — svenevs, Jan 05 '16 at 06:46

einpoklum · Answer 2 · 2022-03-23T13:38:23.520

10

These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.

edited Mar 23 '22 at 13:38

answered Mar 21 '16 at 18:50

einpoklum

118,144
57
340
684

cuda memory alignment

2 Answers2