My OpenCL code with vectorization is like this
short8 x0, x1, x2, x3, x4, x5, x6, x7, m[8];
x0 = convert_short8(vload8(0, Org + 0 * Stride));
x1 = convert_short8(vload8(0, Org + 1 * Stride));
x2 = convert_short8(vload8(0, Org + 2 * Stride));
x3 = convert_short8(vload8(0, Org + 3 * Stride));
x4 = convert_short8(vload8(0, Org + 4 * Stride));
x5 = convert_short8(vload8(0, Org + 5 * Stride));
x6 = convert_short8(vload8(0, Org + 6 * Stride));
x7 = convert_short8(vload8(0, Org + 7 * Stride));
m[0] = x0 + x4;
m[1] = x1 + x5;
m[2] = x2 + x6;
m[3] = x3 + x7;
m[4] = x0 - x4;
m[5] = x1 - x5;
m[6] = x2 - x6;
m[7] = x3 - x7;
Now I'm trying to rewrite the above logic using Intel OpenCL subgroup extensions with block read.
int8 iO;
uint8 block1,block2;
int2 coordA;
coordA = int2(0,0);
block1 = intel_sub_group_block_read8(Org, coordA);
coordA.x += 4;
block2 = intel_sub_group_block_read8(Org, coordA);
for (int i = 0 ; i < 8; i++)
{
iO.lo = convert_int4(as_uchar4(((uint*)(&block1))[i]));
iO.hi = convert_int4(as_uchar4(((uint*)(&block2))[i]));
// Do computations here
}
Here I'm reading 2 blocks of 8 rows each, of type uint
. On typecasting to uchar
I get 2 8x4 blocks of data which is effectively an 8x8 block of uchar
type data. But the problem with the above approach is that it will create work items with data in row major order. So if I try to do the computations like m[0] = x0 + x4
, it is not possible as x0
and x4
will be in different work items. So the only other way I can think of doing this is by storing the data in column major order, in the work items. So, instead of horizontal threads, I'll have vertical threads. But I'm not able to figure out how to do it.