1

I have a set of operations running in a loop.

for(int i = 0; i < row; i++)
{
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]

    arr1 += offset1;
    arr2 += offset2;
}

Now I'm trying to vectorize the operations like this

for(int i = 0; i < row; i++)
{
    convert_int4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

But how do I accumulate the resulting vector in the scalar sum without using a loop?

I'm using OpenCL 2.0.

Harsh Wardhan
  • 2,110
  • 10
  • 36
  • 51

3 Answers3

1

The operation is called "reduction" and there seems to be some information on it here.

In OpenCL special functions seem to be implemented, one being work_group_reduce() that might aid you: link.

And a presentation including some code: link.

JHBonarius
  • 10,824
  • 3
  • 22
  • 41
  • Reductions, as far as I can understand, are for work items in a work group. Whereas, in my code it's a single work item. – Harsh Wardhan Feb 10 '17 at 10:36
  • "Reduction" is a general concept, where multiple variables are 'reduced' to one. You can use different operations for that: add, multiply, min, max, XOR, AND, OR, etc. The links is send you show some code on how to write efficient parallel code for realization. As every situation is different, I am not sure there is a simple operation that solves your problem. – JHBonarius Feb 10 '17 at 11:44
1

For float2,float4 and similar, easiest version could be dot product. (conversion from int to float could be expensive)

float4 v1=(float4 )(1,2,3,4);
float4 v2=(float4 )(5,6,7,8);

float sum=dot(v1-v2,(float4)(1,1,1,1));

this is equal to

(v1.x-v2.x)*1 + (v1.y-v2.y)*1+(v1.z-v2.z)*1+(v1.w-v2.w)*1 

and if there is any hardware support for it, leaving it to compiler's mercy should be okay. For larger vectors and especially arrays, J.H.Bonarius's answer is the way to go. Only CPU has such vertical sum operations as I know, GPU doesn't have this but for the sake of portability, dot product and work_group_reduce are easiest ways to achieve readability and even performance.

Dot product has extra multiplications so it may not be good always.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Why is this operation not supported for integers? – Harsh Wardhan Feb 11 '17 at 04:48
  • Maybe because of software industry . Such as game developers writing their own square root algorithms for a long time instead of asking it to hardware vendors. Also you can convert your horizontal add to vertical add So you can add just like vectors but at Least 4 5 vectors are needed – huseyin tugrul buyukisik Feb 11 '17 at 09:06
1

I have found a solution which seems to be the closest way I could have expected to solve my problem.

uint sum = 0;
uint4 S;

for(int i = 0; i < row; i++)
{
    S += convert_uint4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

S.s01 = S.s01 + S.s23;
sum = S.s0 + S.s1;

OpenCL 2.0 provides this functionality with vectors where the elements of the vectors can successively be replaced with the addition operation as shown above. This can support up to a vector of size 16. Larger operations can be split into factors of smaller operations. For example, for adding the absolute values of differences between two vectors of size 32, we can do the following:

uint sum = 0;
uint16 S0, S1;

for(int i = 0; i < row; i++)
{
    S0 += convert_uint16(abs(vload16(0, arr1) - vload16(0, arr2)));
    S1 += convert_uint16(abs(vload16(1, arr1) - vload16(1, arr2)));

    arr1 += offset1;
    arr2 += offset2;
}

S0 = S0 + S1;
S0.s01234567 = S0.s01234567 + S0.s89abcdef;
S0.s0123 = S0.s0123 + S0.s4567;
S0.s01 = S0.s01 + S0.s23;
sum = S0.s0 + S0.s1;
Harsh Wardhan
  • 2,110
  • 10
  • 36
  • 51