Cuda - selective memory store

Question

In my kernel, if a condition is met, I update an item of the output buffer

if (condition(input[i])) //?
    output[i] = 1;

otherwise the output may stay the same, having value of 0.

The density of updates are quite unpredictable, depending on the input. Furthermore which output location will be updated is also not known. (i may force them though, in some cases)

My question is, is it better to write all items, to achieve coalescing, or do a selective write?

output[i] = condition(input[i]); //?

Would you mind discussing your statements?

Eugene · Answer 1 · 2012-08-23T15:44:00.637

1

On my setup kernel that does conditional set (option 1) runs for 1.727 us and option 2 1.399 us. This is my code (setConditional is the faster one):

__global__ void conditionalSet(unsigned int* array) {
    if ((threadIdx.x & 3) == 0) {
        array[threadIdx.x] = 1;
    }
}

__global__ void setConditional(unsigned int* array) {
    array[threadIdx.x] = (threadIdx.x & 3) == 0 ? 1 : 0;
}

edited Aug 23 '12 at 15:44

answered Aug 22 '12 at 22:32

Eugene

9,242
2
30
29

1

"ms" stands for "millisecond" and "us" stands for "microsecond". So what is "mus"? – harrism Aug 23 '12 at 01:37
Your timings cannot be right -- kernel launch overhead is much more than a couple of microseconds. How are you timing? – harrism Aug 24 '12 at 02:18
I've got those numbers from visual profiler (when clicking the kernel in the timeline - they exclude the overhead). But, as you pointed out, they are totally bogus as the difference in the run time is likely to be attributed to difference in instruction count, not the memory access pattern. Once I fix my local env I will experiment more. – Eugene Aug 24 '12 at 15:45

score 1 · Accepted Answer · answered Aug 23 '12 at 01:36

1

Coalescing is achieved even if some threads in the warp do not participate in the load or store, as long as all participating threads satisfy the requirements of coalescing. So conditional writes should have no effect on memory throughput.

However, doing a conditional write may involve additional instructions due to involving a branch (this would probably explain, for example, the difference in performance measured by Eugene in his answer).

answered Aug 23 '12 at 01:36

harrism

26,505
2
57
88

What if the rate of writing the output buffer is quite low. – phoad Aug 23 '12 at 10:32
1

Do you mean, "what if the fraction of threads per warp that do the store is low"? Because memory transactions are performed per warp, assuming at least one thread per warp does a store, then my answer is the same. If at least one thread per warp performs the store, then the cost (assuming perfect coalescing) is the same as if all threads had done it. But if a significant fraction of warps do no stores, then the `if()` approach is definitely superior. – harrism Aug 24 '12 at 02:21
Thank you @Harrism. So, is there no benefit of storing the results on the shared memory and storing them back to global memory in one shot? It may enable changing the write types to float4 from float for 4 values though. Is just a synchronization enough before global memory writes? Can you point a good approach for global memory stores, ways of limiting the writes, or not limiting them? (I may convert this to a question too :) – phoad Aug 26 '12 at 20:06
I think you should do some experiments. The answer to your question really depends on the particulars of your computation. – harrism Aug 26 '12 at 23:39
Yes, it will be better to do some experiments, and I may share the results for further discussion. kind regards. – phoad Aug 27 '12 at 10:55

Cuda - selective memory store

2 Answers2