Summing 3 vectors and get the result in neon

Question

I'm trying to sum d0,d1,d2 + d3,d4,d5+ d6,d7,d8. I don't know the best instruction for that and then take the average by 9. I know how to do the averaging using approximation, but summing those lanes, I can't find an instruction for that ? I also have incorrect output image, so I suspect of the averaging operation if it is correct or not.

inline void downsample3dOnePass( uint8_t* src, uint8_t *dst, int srcWidth)
{

    for (int r = 0; r < (int)srcWidth/3; r++)
    {
       // load 24 pixels (grayscale)
       uint8x8x3_t r0       = vld3_u8(src);
       // move to next 24 byes
       src+=24;
       uint8x8x3_t r1       = vld3_u8(src);
       src+=24;
       uint8x8x3_t r2       = vld3_u8(src);

       uint16x8_t  d0  = vmovl_u8(r0.val[0]);
       uint16x8_t  d1  = vmovl_u8(r0.val[1]);
       uint16x8_t  d2  = vmovl_u8(r0.val[2]);

       uint16x8_t  d3  = vmovl_u8(r1.val[0]);
       uint16x8_t  d4  = vmovl_u8(r1.val[1]);
       uint16x8_t  d5  = vmovl_u8(r1.val[2]);

       uint16x8_t  d6  = vmovl_u8(r2.val[0]);
       uint16x8_t  d7  = vmovl_u8(r2.val[1]);
       uint16x8_t  d8  = vmovl_u8(r2.val[2]);

       uint16x8_t d0d3Sum      = vaddq_u16 ( d0, d3);
       uint16x8_t d0d3d6Sum    = vaddq_u16 ( d0d3Sum,  d6 );

       uint16x8_t d1d4Sum      = vaddq_u16 ( d1, d4);
       uint16x8_t d1d4d7Sum    = vaddq_u16 ( d1d4Sum, d7);

       uint16x8_t d2d5Sum      = vaddq_u16 ( d2, d5 );
       uint16x8_t d2d5d8Sum    = vaddq_u16 ( d2d5Sum, d8);

       uint16x8_t firstSum     = vaddq_u16(d0d3d6Sum, d1d4d7Sum);
       uint16x8_t secondSum    = vaddq_u16(firstSum, d2d5d8Sum);
       uint16x8_t totalSum     = vaddq_u16 ( firstSum, secondSum);

       // average = r0+r1+r2/8 ~9 for test
       uint16x8_t totalAverage = vshrq_n_u16(totalSum,3);
       uint8x8_t  finalValue   = vmovn_u16(totalAverage);
       // store 8 bytes
       vst1_u8(dst, finalValue);

       src+=24;
       // move to next row
       dst+=8;

   }

}

void downsample3d( uint8_t* src, uint8_t *dest, int srcWidth, int srcHeight )
{
    for (int r = 0; r < (int)srcHeight/3; r++)
    {
         downsample3dOnePass(src, dest, srcWidth);
    }
}

UPDATE: According to bitbank answer:

    inline void downsample3dOnePass( uint8_t* src, uint8_t *dst, int srcWidth, int srcHeight, int strideSrc, int strideDest)
    {
        int iDestPitch = (strideDest);
        uint8_t *s, *d;
        uint8x8x3_t u88line0;
        uint8x8x3_t u88line1;
        uint8x8x3_t u88line2;
        uint8x8_t   u88Final;
        uint16x8_t  u168Sum;
        int16x8_t   i168divisor = vdupq_n_s16(7282/2); // 65536/9 - used with doubling saturating return high multiply

        for (int r = 0; r < srcHeight/3; r++)
        {
            d = &dst[iDestPitch * r];
            s = &src[srcWidth * r*3];

            for (int c = 0; c < srcWidth/3; c+=8)
            {
                // load 8 sets of 3x3 pixels (grayscale)
                u88line0 = vld3_u8(&s[0]);
                u88line1 = vld3_u8(&s[srcWidth]);
                u88line2 = vld3_u8(&s[srcWidth*2]);
                s += 24;
                // Sum vertically
                u168Sum = vaddl_u8(u88line0.val[0], u88line0.val[1]); // add with widening
                u168Sum = vaddw_u8(u168Sum, u88line0.val[2]); // accumulate with widening (horizontally)
                u168Sum = vaddw_u8(u168Sum, u88line1.val[0]); // add the other vectors together
                u168Sum = vaddw_u8(u168Sum, u88line1.val[1]);
                u168Sum = vaddw_u8(u168Sum, u88line1.val[2]);
                u168Sum = vaddw_u8(u168Sum, u88line2.val[0]);
                u168Sum = vaddw_u8(u168Sum, u88line2.val[1]);
                u168Sum = vaddw_u8(u168Sum, u88line2.val[2]);
                // we now have the 8 sets of 3x3 pixels summed to 8 16-bit values
                // To divide by 9 we will instead multiply by the inverse (65536/9) = 7282
                u168Sum = vreinterpretq_u16_s16(vqrdmulhq_s16(i168divisor, vreinterpretq_s16_u16(u168Sum)));
                u88Final = vmovn_u16(u168Sum); // narrow to 8 bits
                // store 8 bytes
                vst1_u8(d, u88Final);
                d += 8;
            } // for column
        } // for row
    }


usage: 
//1280*920*grayscale
QImage normalImage("/data/normal_image.png");

uint8_t *resultImage = new uint8_t[440*306];
  downsample3dOnePass(normalImage.bits(),resultImage, normalImage.width(), normalImage.height(), 1280, 440);

You're adding multiple bytes together and storing them in a byte, then taking the average of that. Picture, as an example, what would happen if the corresponding bytes in two vectors were 0xff and 0x01 and you added those together as bytes. You'll either have to expand all the pixels to 16-bit values while summing them, or right-shift prior to adding (avoid the latter method if you can, since it will result in unnecessary loss of precision). — Michael, Mar 20 '13 at 14:30
@Michael I'm looking for an intrinsic that converts uint8x8 to uint16x8, but can't find it. +1 for notice :) — andre_lamothe, Mar 20 '13 at 14:40
`VMOVL (Vector Move Long) takes each element in a doubleword vector, sign or zero extends them to twice their original length, and places the results in a quadword vector.`. So the intrinsic you want is `uint16x8_t vmovl_u8 (uint8x8_t)` — Michael, Mar 20 '13 at 14:49

score 3 · Accepted Answer · answered Mar 20 '13 at 16:06

3

There are several problems with your code. NEON intrinsics are pretty bad when it comes to the VLDx handling, but your big mistakes are that you're overflowing your byte values and loading pixels horizontally instead of vertically. Here's a better algorithm which will process 8*3x3 source pixels into 8 destination pixels at a time. Your function is also missing the rows parameter.

inline void downsample3dOnePass( uint8_t* src, uint8_t *dst, int srcWidth, int srcHeight)
{
int iDestPitch = ((srcWidth/3)+3) & 0xfffffffc; // DWORD aligned
uint8_t *s, *d;
uint8x8x3_t u88line0, u88line, u88line2;
uint8x8_t u88Final;
uint16x8_t u168Sum;
int16x8_t i168divisor = vdupq_n_s16(7282/2); // 65536/9 - used with doubling saturating return high multiply

  for (int r = 0; r < srcHeight/3; r++)
    {
    d = &dst[iDestPitch * r];
    s = &src[srcWidth * r*3];

    for (int c = 0; c < srcWidth/3; c+=8)
    {
       // load 8 sets of 3x3 pixels (grayscale)
       u88line0 = vld3_u8(&s[0]);
       u88line1 = vld3_u8(&s[srcWidth]);
       u88line2 = vld3_u8(&s[srcWidth*2]);
       s += 24;
       // Sum vertically
       u168Sum = vaddl_u8(u88Line0.val[0], u88Line0.val[1]); // add with widening
       u168Sum = vaddw_u8(u168Sum, u88Line0.val[2]); // accumulate with widening (horizontally)
       u168Sum = vaddw_u8(u168Sum, u88Line1.val[0]); // add the other vectors together
       u168Sum = vaddw_u8(u168Sum, u88Line1.val[1]);
       u168Sum = vaddw_u8(u168Sum, u88Line1.val[2]);
       u168Sum = vaddw_u8(u168Sum, u88Line2.val[0]);
       u168Sum = vaddw_u8(u168Sum, u88Line2.val[1]);
       u168Sum = vaddw_u8(u168Sum, u88Line2.val[2]);
       // we now have the 8 sets of 3x3 pixels summed to 8 16-bit values   
       // To divide by 9 we will instead multiply by the inverse (65536/9) = 7282
       u168Sum = vreinterpretq_u16_s16(vqrdmulhq_s16(i168divisor, vreinterpretq_s16_u16(u168Sum)));
       u88Final = vmovn_u16(u168Sum); // narrow to 8 bits
       // store 8 bytes
       vst1_u8(d, u88Final);
       d += 8;    
   } // for column
} // for row

answered Mar 20 '13 at 16:06

BitBank

8,500
3
28
46

the image is distorted. here is how I call it uint8_t *resultImage = new uint8_t[450*310]; downsample3dOnePass(normalImage.bits(),resultImage, normalImage.width(), normalImage.height()); The normalImage is 1280*920 +1 – andre_lamothe Mar 20 '13 at 16:29
"distorted" is not very descriptive. Check that the pitch of the source and destination images is done correctly. In my code above, I use a dword-aligned destination pitch. You may not be assuming this value, so change it to srcWidth/3. From your original code, it appears that you don't have a solid grasp of how images are laid out in memory and that sounds like the current issue as well. – BitBank Mar 20 '13 at 16:44
sorry but I didn't get you. Should I change int iDestPitch = ((srcWidth/3)+3) & 0xfffffffc; to int iDestPitch = (srcWidth/3);The image is shown here http://i45.tinypic.com/2wnwqw9.png – andre_lamothe Mar 20 '13 at 16:51
1

That's my point. The pitch (bytes per line) of the image is not handled correctly. Your misunderstanding of this concept is one of the stumbling blocks preventing you from writing working code. PITCH or STRIDE is the number of bytes per row of the image regardless of how many pixels there are. The png you displayed has the pitch wrong which is why it has that diagonal line pattern. Regardless of what pitch you choose, whatever is reading/displaying the image must use the same value. PNG is byte-aligned, not dword-aligned. – BitBank Mar 20 '13 at 16:58
I assigned int iDestPitch = stride, and stride is the width of the dest image which is 360. The diagnoal image is changed, and now it has the effect of GL_Repeated on horizontal side... – andre_lamothe Mar 20 '13 at 17:10
That's a problem - you said the original image is 1280x920. 1280/3 = 426.6666. How did you get to 360? These comments are not the right place to explain how raster graphics work. You need to have a clearer understanding of memory layouts. If you start with a 1280x920 image, the code above will create a 426x306 image. How you handle it from there is up to you. – BitBank Mar 20 '13 at 17:20
using 426*306 as the dest image, it gives segmentation fault. – andre_lamothe Mar 20 '13 at 17:24
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26566/discussion-between-bitbank-and-ahmed-saleh) – BitBank Mar 20 '13 at 17:30
I fixed it thanks. I allocated uint8_t *resultImage = new uint8_t[864*312]; then the stride is 864 downsample3dOnePass(normalImage.bits(),resultImage, normalImage.width(), normalImage.height(), 1280, 864); then: QImage imgIn (resultImage, 430, 310,QImage::Format_Indexed8) iDestPitch = strideDest/2; // 864/2 I don't know why it's worked by on paper it should work with this maths – andre_lamothe Mar 20 '13 at 18:08

score 0 · Answer 2 · answered Mar 20 '13 at 15:00

0

In order to avoid overflow while adding the bytes of several vectors together you should expand from bytes to halfwords (16-bit) prior to the summation. Once you've summed up all the pixels and divided the result you can narrow the result back to bytes.

The NEON intrinsic to use for expanding bytes to halfwords in GCC is
uint16x8_t vmovl_u8 (uint8x8_t)

And the corresponding intrinsic for narrowing is
uint8x8_t vmovn_u16 (uint16x8_t)

Note that if you add 9 pixels and divide by 8 you might still risk overflow when narrowing back to bytes. In that case you could use vqmovn_u16 which behaves like vmovn_u16 but also performs saturation.

answered Mar 20 '13 at 15:00

Michael

57,169
9
80
125

I still have distorted image :/ I have updated the full code. – andre_lamothe Mar 20 '13 at 15:03
`uint16x8_t totalSum = vaddq_u16 ( firstSum, secondSum);` this line looks incorrect to me. `secondSum` should already contain the total sum of all pixels at this point, so doing another addition would just sum the pixels multiple times. – Michael Mar 20 '13 at 15:11
a mistake... I changed it but same no result :S – andre_lamothe Mar 20 '13 at 15:14
1

Then it's probably time to start dumping and analyzing the output. Create a small single-colored or chessboard pattern bitmap, run your algorithm on it and dump the result to stdout or a file and see if you can't spot the problem based on that. – Michael Mar 20 '13 at 15:17
do you think it is correct for iteration over 24, and looping thru the images is correct ? – andre_lamothe Mar 20 '13 at 15:19
1

It should be ok if you only want to average horizontally. But the way you add pixels looks a bit off. Keep in mind that `vld3` performs de-interleaving on the loaded data. So for example (unless I'm mistaken), your output pixel 0 will be the average of input pixels `0,24,48,1,25,49,2,26 and 50`, not `0,1,2,3,4,5,6,7 and 8` which I assume that you'd want to use. – Michael Mar 20 '13 at 15:51
so you mean the summation of d0,d1,d2..etc are wrong ? I'confused – andre_lamothe Mar 20 '13 at 16:03
1

Adding 24 to the source pointer does not move you vertically - see my answer for the correct solution. – BitBank Mar 20 '13 at 16:10

Summing 3 vectors and get the result in neon

2 Answers2