0

is there any way we can DE-interleave 32bpp image channels similar as below code in neon.

//Read all r,g,b,a pixels into 4 registers
uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32);

ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), 
channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high);

basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very familiar with SSE and could not find such instruction in SSE or if some one can provide better ways to do that? Any help is highly appreciated

Bharat Ahuja
  • 394
  • 2
  • 15
  • Are your pixels 4 x 8 bits RGBA, i.e. 32 bits for each pixel ? And what output format do you want, separate vectors of 8 bit R, G, B, and A, or do you also want to unpack the 8 bit values to 32 bits at the same time ? – Paul R Mar 08 '16 at 16:26
  • yeah my image is 32 bpp 8 bits per pixel and yeas I also want to unpack the 8 bit values to 32 bits at the same time. something like R, R, R, R( where each R takes 32 bits ), simillarily B, B, B,B... basically this helps when i multiply each R, B,G, A by some 32 bit value. – Bharat Ahuja Mar 08 '16 at 16:30
  • I am just trying to implement Gaussian blur where my Gaussian coefficient are of 32 bit so i need this de interleaving and then i can simply multiply de-Interleaved vectors with gauss vector – Bharat Ahuja Mar 08 '16 at 16:31
  • 1
    You probably don't want to do 32x32 bit multiplies for a filter such as this, especially if it's performance-critical. Use fixed point 16x16 multiplies. – Paul R Mar 08 '16 at 16:38
  • can you please show how can we do 16x16 and store result in 32 bit.. yeah actully my gausss coefficient are 16 bit values. – Bharat Ahuja Mar 08 '16 at 16:47

1 Answers1

2

Since the 8 bit values are unsigned you can just do this with shifting and masking, much like you would for scalar code, e.g.

__m128i vrgba;

__m128i vr = _mm_and_si128(vrgba, _mm_set1_epi32(0xff));
__m128i vg = _mm_and_si128(_mm_srli_epi32(vrgba, 8), _mm_set1_epi32(0xff));
__m128i vb = _mm_and_si128(_mm_srli_epi32(vrgba, 16), _mm_set1_epi32(0xff));
__m128i va = _mm_srli_epi32(vrgba, 24);

Note that I'm assuming your RGBA elements have the R component in the LS 8 bits and the A component in the MS 8 bits, but if they are the opposite endianness you can just change the names of the vr/vg/vb/va vectors.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thanks paul that worked i am wondering how can we modify this to have 16 bits packing rather than 32 – Bharat Ahuja Mar 08 '16 at 17:04
  • You can do something very similar for 16 bits - obviously you would need to start with two input vectors though. If you need further help with this then post a new question asking for a 16 bit solution and tag it `simd` and I'll come up with something (unless someone else beats me to it). – Paul R Mar 08 '16 at 17:16
  • Sure i was able to do this for 32 bit. i will see performance factors first and if need i will add another question with actual algorithm i am trying to implement. Thanku so much for you kind help i was stuck with this from 2 days since its very hard to find good docmentation about it – Bharat Ahuja Mar 08 '16 at 17:28