How to distribute data read using intel_sub_group_block_read across work items in a subgroup in column major order in OpenCL?

Question

My OpenCL code with vectorization is like this

short8 x0, x1, x2, x3, x4, x5, x6, x7, m[8];

x0 = convert_short8(vload8(0, Org + 0 * Stride));
x1 = convert_short8(vload8(0, Org + 1 * Stride));
x2 = convert_short8(vload8(0, Org + 2 * Stride));
x3 = convert_short8(vload8(0, Org + 3 * Stride));
x4 = convert_short8(vload8(0, Org + 4 * Stride));
x5 = convert_short8(vload8(0, Org + 5 * Stride));
x6 = convert_short8(vload8(0, Org + 6 * Stride));
x7 = convert_short8(vload8(0, Org + 7 * Stride));

m[0] = x0 + x4;
m[1] = x1 + x5;
m[2] = x2 + x6;
m[3] = x3 + x7;
m[4] = x0 - x4;
m[5] = x1 - x5;
m[6] = x2 - x6;
m[7] = x3 - x7;

Now I'm trying to rewrite the above logic using Intel OpenCL subgroup extensions with block read.

int8 iO;
uint8 block1,block2;
int2 coordA;
coordA = int2(0,0);

block1 = intel_sub_group_block_read8(Org, coordA);
coordA.x += 4;
block2 = intel_sub_group_block_read8(Org, coordA);

for (int i = 0 ; i < 8; i++)
{
    iO.lo = convert_int4(as_uchar4(((uint*)(&block1))[i]));
    iO.hi = convert_int4(as_uchar4(((uint*)(&block2))[i]));
    // Do computations here
}

Here I'm reading 2 blocks of 8 rows each, of type uint. On typecasting to uchar I get 2 8x4 blocks of data which is effectively an 8x8 block of uchar type data. But the problem with the above approach is that it will create work items with data in row major order. So if I try to do the computations like m[0] = x0 + x4, it is not possible as x0 and x4 will be in different work items. So the only other way I can think of doing this is by storing the data in column major order, in the work items. So, instead of horizontal threads, I'll have vertical threads. But I'm not able to figure out how to do it.

Can you shuffle elements `intel_subgroup_shuffle*` into the correct lane ("thread")? — Tim, Mar 27 '17 at 17:34
I think I do not fully understand the ```intel_subgroup_shuffle```. As far as I understand, shuffle copies the same data to all the threads. So in order to use shuffle, in my case, I'll have to do scalar loads instead of vector loads, right? But is it more efficient than vector operations? — Harsh Wardhan, Mar 28 '17 at 12:47
Not an expert on SGs myself, but I believe the subgroup id doesn't have to be uniform. I.e. you can rotate a vector value. `_shuffle(data, (id + k) % subgroup_size)`. See [https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt — Tim, Mar 30 '17 at 16:39

How to distribute data read using intel_sub_group_block_read across work items in a subgroup in column major order in OpenCL?

0 Answers0