1

The reason why I ask this is because there is some strange bug in my code and I suspect it could be some aliasing problem:

__shared__ float x[32];
__shared__ unsigned int xsum[32];

int idx=threadIdx.x;
unsigned char * xchar=(unsigned char *)x;
//...do something
 if (threadIdx.x<32)
 {
    xchar[4*idx]&=somestring[0];
    xchar[4*idx+1]&=somestring[1];
    xchar[4*idx+2]&=somestring[2];
    xchar[4*idx+3]&=somestring[3];

    xsum[idx]+=*((unsigned int *)(x+idx));//<-Looks like the compiler sometimes fail to recongize this as the aliasing of xchar;
 };
user0002128
  • 2,785
  • 2
  • 23
  • 40
  • Why aren't you putting `__shared__` in front of `unsigned char * xchar`? Same thing for `(unsigned int *)x`. – thejh Mar 28 '13 at 23:32
  • @thejh Do I need to do that here? – user0002128 Mar 28 '13 at 23:35
  • Not sure, but I think so. You could try whether it maybe makes your code work... – thejh Mar 28 '13 at 23:38
  • @thejh I think that putting `__shared__` doesn't change anything as the memory has already been allocated in shared memory. xchar is just a pointer on this memory. I think there is a race condition here. Every thread is writing on the same memory. So when do execute `xsum+=...` you don't know how many threads have done the previous operations. – Seltymar Mar 29 '13 at 00:42
  • what happens if you put the `volatile` keyword in front of your definitions for `x[]` and `*xchar` ? And if your kernel actually ends at the `xsum[id] +=...` line, then you have other issues as well. I'm assuming there is other code after that. – Robert Crovella Mar 29 '13 at 03:33
  • @RobertCrovella What other issues? Its just a code example to demonstrate the aliasing problem, it has nothing to do with the real codes, however in this example, I cannot see why the kernel cannot end with the xsum+= line other than being pointless at there. – user0002128 Mar 29 '13 at 09:44
  • @RobertCrovella And why need volatile here? the code is within warp-scale, and basically data used are tied at each lane, the modified data should be visable to its own lane/thread, unless the compiler treat the two pointers as being independant and creat some ILP that could cause a race there. – user0002128 Mar 29 '13 at 10:07
  • If all your code did was update shared memory, the compiler could optimize things out. And without the volatile keyword, the compiler is free to optimize any shared memory access into a register, which might have some ramifications for pointer aliasing. If you had posted real, sensible code, I would try it myself. – Robert Crovella Mar 29 '13 at 13:38

2 Answers2

2

The compiler only needs to honour aliasing between compatible types. Since char and float are not compatible, the compiler is free to assume the pointers never alias.

If you want to do bitwise operations on float, firstly convert (via __float_as_int()) to unsigned integer, then operate on that, and finally convert back to float (using __int_as_float()).

talonmies
  • 70,661
  • 34
  • 192
  • 269
tera
  • 7,080
  • 1
  • 21
  • 32
  • There is something I don't understand. In the programming guide p.85, they declare `array0` with a different type pointing on the memory of `array`. There is aliasing here, right ? If I use `array` after modifying `array0` I should get the modified value ? – Seltymar Mar 29 '13 at 08:11
  • 1
    That would be undefined. In the [example](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) it is implicitly assumed that `array` is never accessed. It only serves to allocate the memory for backing `array0`, `array1`, and `array2`. – tera Mar 29 '13 at 12:26
1

I think you have a race condition here. But I don't know what is somestring. If it is the same for all threads you can do like this:

__shared__ float x[32];

unsigned char * xchar=(unsigned char *)x;

//...do something

if(threadIdx.x<4) {
     xchar[threadIdx.x]&=somestring[threadIdx.x];
}

__syncthreads();

unsigned int xsum+=*((unsigned int *)x);

It means that every thread shares the same array and therefore, xsum is the same between all threads. If you want that each thread has its own array, you have to allocate an array of 32*number_of_threads_in_block and use an offset.

PS: the code above works only in 1D block. In 2D or 3D you have to compute you own threadID and be sure that only 4 threads execute the code.

Seltymar
  • 337
  • 6
  • 21
  • My apologize, the example code was misleading, I were just trying to save a few lines of codes in this example, in my original code, there is no race condition. – user0002128 Mar 29 '13 at 01:42
  • @user0002128 ok, so I guess you allocate memory at runtime, otherwise it would only work for block of 32 threads. – Seltymar Mar 29 '13 at 01:47