OpenCL - Most efficient way to split byte into an 8-component-vector

Question

I'm building a simulation of the Ising Model in OpenCL which means that my data consists of a bunch of states which can either be up/1 or down/-1.

To save memory bandwidth 8 of these states are encoded into a single byte (up=1, down=0). Now in one of the calculations I need an integer vector with values corresponding to the original states, i.e. 1 or -1.

Example:
Input byte (uchar in OpenCL): 01010011
Convert to: (int8)(-1,1,-1,1,-1,-1,1,1);

I do have a working solution for that problem, but I'm wondering if there is a quicker, more efficient way:

uchar c = spins[id];
int8 spin;
spin.s0 = (c >> 0) & 1;
spin.s1 = (c >> 1) & 1;
spin.s2 = (c >> 2) & 1;
spin.s3 = (c >> 3) & 1;
spin.s4 = (c >> 4) & 1;
spin.s5 = (c >> 5) & 1;
spin.s6 = (c >> 6) & 1;
spin.s7 = (c >> 7) & 1;
spin = spin * 2 - 1;

EDIT:

Does not seem to be faster in my situation, but it's more concise at least:

__constant uchar8 bits = (uchar8)(0,1,2,3,4,5,6,7);

uchar c = spins[id];
int8 spin = convert_int8((uchar8)(c) >> bits & 1) * 2 - 1;

This seems already quite a neat solution, Why go for something more complex? `int8 spin = ((int8)(c) >> (int8)(0,1,2,3,4,5,6,7) & 1) * 2 - 1;` — DarkZeros, Mar 24 '16 at 11:26

huseyin tugrul buyukisik · Accepted Answer · 2016-03-23T22:40:22.243

bool8 is still a reserved type it seems. I thought it would be open for users now, I'm wrong.

Option 1)

Not safe nor (%100 sure) working on all hardware but you can define this union

            typedef union hardwareBool8{
                char  v;
                bool bit_select[8];
            } vecb8;

then in a kernel:

            vecb8 t={5}; // initialize with any number from your uchar/char
            t.v=1; // or initialize with this
            t.bit_select[4]=0; // set or get to some integer
            int intVariable =t.bit_select[7]; // can be 1 or 0 or -1,you should try. If not -1 then you can negate
            int intVariable2=-t.bit_select[7];

this is compiling on my amd machine but im not sure for any other hardware. Even endianness can be a problem.

Option 2)

Maybe broadcasting same char to 8 threads(or accessing same location from 8 threads):

   char charVar= ... load from same address/index ;

then working on different bit index on each thread:

  spin.s0 = (c >> 0) & 1; (on thread 0)

...

  spin.s7 = (c >> 7) & 1; (on thread 7)

should give it some performance but for only single spin element. Many up-to-date gpu architectures support broadcasting same data to all threads in a single instruction. If your device is a CPU, 8 threads per workgroup shouldnt slow much but if it is gpu, then selecting 1 char per consecutive 8 threads is tricky. Something like

  charArrayIndex = globalThreadId / 8 
  c = charArray[charArrayIndex];

  // assuming spin is local memory array and shared by work group threads
  spin[globalThreadId % 8] = (c >> (globalThreadId % 8)) & 1;

If spin has to be private variable, you can use same local memory array as a communication array to copy values to all threads' private variables. This is going from (instruction level + thread level) parallelism to only thread level parallelism.

Option 3)

You can distribute bit selection(all 8 of them) to different "units" of a core, if operations are done in different units then this may benefit of out of order execution.

spin.s2 = (c / 4) & 1;   // 1 division and 1 logical
spin.s0 = (c) & 1;       //  1 logical
spin.s1 = (c & 2)>0;   //  1 logical and 1 comparison

Its like getting a spin element using an expensive but independent way and while it computes the heavy work, other elements are computed using instruction level parallelism. Also the last element doesnt need "and" ing with 1. Because there is only single bit on the right. You save another instruction this way. — huseyin tugrul buyukisik, Mar 24 '16 at 12:00
I don't think using a union is working in this case. It does compile on my machine (AMD too), but yields strange results. I don't think addressing the bits of a byte can be done this way as bools probably are not just a single bit wide. — Gigo, Mar 24 '16 at 14:23

OpenCL - Most efficient way to split byte into an 8-component-vector

1 Answers1