7

Can anyone tell me whats wrong with the following code inside a CUDA kernel:

__constant__ unsigned char MT[256] = {
    0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....};

typedef unsinged int U32;

__global__ void Kernel (unsigned int  *PT, unsigned int  *CT, unsigned int  *rk)
{

    long int i;
    __shared__ unsigned char sh_MT[256];    

    for (i = 0; i < 64; i += 4)
        ((U32*)sh_MT)[threadIdx.x + i] = ((U32*)MT)[threadIdx.x + i];

    __shared__ unsigned int sh_rkey[4];
    __shared__ unsigned int sh_state_pl[4];
    __shared__ unsigned int sh_state_ct[4];

    sh_state_pl[threadIdx.x] = PT[threadIdx.x];
    sh_rkey[threadIdx.x] = rk[threadIdx.x];
    __syncthreads();


    sh_state_ct[threadIdx.x] = ((U32*)sh_MT)[sh_state_pl[threadIdx.x]]^\
    ((U32*)(sh_MT+3))[((sh_state_pl[(1 + threadIdx.x) % 4] >> 8) & 0xff)] ^ \
    ((U32*)(sh_MT+2))[((sh_state_pl[(2 + threadIdx.x) % 4] >> 16) & 0xff)] ^\
    ((U32*)(sh_MT+1))[((sh_state_pl[(3 + threadIdx.x) % 4] >> 24) & 0xff )];


    CT[threadIdx.x] = sh_state_ct[threadIdx.x];
}

At This line of code ,

((U32*)(sh_MT+3))......

The CUDA debugger gives me the error message : misaligned address

How can I fix this error?

I am using CUDA 7 in MVSC and i use 1 Block and 4 threads for executing the Kernel Function as follow:

__device__ unsigned int *state;
__device__ unsigned int *key;
__device__ unsigned int *ct;
.
.
main()
{
cudaMalloc((void**)&state, 16);
cudaMalloc((void**)&ct, 16);
cudaMalloc((void**)&key, 16);
//cudamemcpy(copy some values to => state , ct, key);   
Kernel << <1, 4 >> >(state, ct, key); 
}

Remember please, I can't change my "MT Table" type. Thanks in advance for any advice or answer .

Rezaeimh7
  • 1,467
  • 2
  • 23
  • 40
  • 3
    As the error message tells you, the pointer is not aligned to the boundary required by the processor. Basically, you can't dereference a 32-bit pointer from an address not aligned at a 32-bit boundary. What it means: you can do `(U32*)(sh_MT)` and `(U32*)(sh_MT+4)` but not `(U32*)(sh_MT+3)` or such. You probably have to read the bytes separately and join them together. – CherryDT May 19 '16 at 12:27
  • @CherryDT if you want to provide an answer I would upvote. – Robert Crovella May 19 '16 at 13:24
  • OK I finally found the right source to quote so it's now worthy being an answer. Added. – CherryDT May 19 '16 at 13:37

1 Answers1

15

What the error message means is that the pointer is not aligned to the boundary required by the processor.

From the CUDA Programming Guide, section 5.3.2:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

This is what the debugger is trying to tell you: Basically, you shouldn't dereference a pointer pointing to a 32-bit value from an address not aligned at a 32-bit boundary.

You can do (U32*)(sh_MT) and (U32*)(sh_MT+4) just fine, but not (U32*)(sh_MT+3) or such.

You probably have to read the bytes separately and join them together.

CherryDT
  • 25,571
  • 5
  • 49
  • 74
  • Um.... `sh_MT` points to shared memory, not global, but similar rules may still apply. Also note, that since you are creating a `char[256]` in shmem, its starting address does not have to be aligned to anything more than 1 byte. As a result, `sh_MT+3` may or may not be actually aligned to 4 bytes and it is entirely compiler/hardware specific. – CygnusX1 May 20 '16 at 06:03
  • Hm you are right about global vs shared... But about the size : the OP converts it to a `U32*` before dereferencing – CherryDT May 20 '16 at 06:07
  • Let me rephrase: `sm_MT` is of type `unsigned char[256]`. As such it may be given an address `0x4` or `0x6` or `0x7`. Anything is possible, because it has no alignment requirements. If, say, `0x7` is its address, then `sh_MT+1` will work fine for dereferencing an `U32` type, but `sh_MMT+4` will not. I may add - such values are not uncommon! On GTX 200-series, first 6 bytes were used to store blockIdx (as `unsigned short`) and 7-th byte (address `0x6`) was assigned to first shared variable, as long as alignment permitted. Newer devices may or may not organize differently – CygnusX1 May 20 '16 at 14:43
  • Yes, right, I assumed it was aligned. OK then one first has to figure out the right alignment. – CherryDT May 20 '16 at 17:48