I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my
template <typename T>
struct MultAddFunctor
{
const T factor;
MultAddFunctor(T factor) : factor(factor) {}
__device__ __forceinline__
T operator()(const T &a, const T &b) const {
return factor*a + b;
}
};
Otherwise, my code is nearly identical to the example code in the documentation (except I have freed allocated memory and added additional syncs to rule that out as a problem). When factor
is 1.0
this produces the correct results (which is just a prefix-sum). When factor
is something else (such as 0.8
) the results are correct for the first 12 values but diverge considerably after that. For example, if the array being scanned is just a bunch of 1.0
s, I get the following results:
CUDA Serial
0 1.000 1.000 ✅
1 1.800 1.800 ✅
2 2.440 2.440 ✅
3 2.952 2.952 ✅
4 3.362 3.362 ✅
5 3.689 3.689 ✅
6 3.951 3.951 ✅
7 4.161 4.161 ✅
8 4.329 4.329 ✅
9 4.463 4.463 ✅
10 4.571 4.571 ✅
11 4.656 4.656 ✅
12 6.313 4.725 ❌
13 6.050 4.780 ❌
14 5.840 4.824 ❌
15 5.672 4.859 ❌
...
At element 12, there is a sudden jump in the values and then a decrease even though this should just keep getting larger at a consistent rate.
At first, I thought it was due to the non-commutativity of the operation, but the docs explicitly state that that is fine. I also thought that the factor
field itself may not be getting to the device correctly, but even if I hard-code a 0.8
in the equation it is still incorrect (although, factor
is probably always in global memory so in the future moving factor
into shared/local would be better).
What other reason could there be that the scan is computing the incorrect results?