False sharing in Cuda GPUs: does it exist / similar to CPUs?

Question

I understand that in symmetric multiprocessor (SMP) systems, false sharing may occur due to the individual caches in each cores, for the following code: http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

01  double sum=0.0, sum_local[NUM_THREADS];
02  #pragma omp parallel num_threads(NUM_THREADS)
03  {
04   int me = omp_get_thread_num();
05   sum_local[me] = 0.0;
06   
07   #pragma omp for
08   for (i = 0; i < N; i++)
09   sum_local[me] += x[i] * y[i];
10   
11   #pragma omp atomic
12   sum += sum_local[me];
13  }

So my questions are:

False sharing mainly comes from fact that memory are accessed via a chunk consisting of a fixed number of bytes, and that each process has its own cache. This chunk of memory are written and read together. Is my understanding close to the fact?
What about the memory access pattern in GPU? Do all the stream processors share one set of cache or have separate ones? Does false sharing also a concern in GPU computing?

score 3 · Accepted Answer · edited Dec 19 '13 at 22:50

3

It's not that the cache line is read or written together, it is that if any CPU writes any byte it invalidates the whole cache line.
Depends on the GPU. At least for some NVidia GPUs, L1 cache (of the global memory) is NOT coherent, so you have other problems. With L1 disabled you can have the problem in L2 cache which is coherent.

edited Dec 19 '13 at 22:50

einpoklum

118,144
57
340
684

answered Dec 15 '13 at 19:58

stark

12,615
3
33
50

False sharing in Cuda GPUs: does it exist / similar to CPUs?

1 Answers1