algorithm for downsample an image by 3 using Neon

Question

I would like to know is it possible to with neon vectors to downsample an image by 3 ? I'm trying to write an algorithm for that on paper, but it seems it is not possible. Because when you get for example 8 bytes, you can not get 3*3pixels, there won't be enough pixels to complete the downsampling operation. According to the downsample by 2: Explaining ARM Neon Image Sampling I think about loading 16bytes, then 8bytes from one row, then assign them to a 32bytes vector, then process it 24 bytes of that vector?

Update: I have written a sample code according to the answer, but I get a segmentation fault in the vst1_u8...

inline void downsample3dOnePass( uint8_t* src, uint8_t *dst, int srcWidth)
{

    // make sure rows/cols dividable by 8
    int rows = ((srcWidth>>3)<<3);
    // 8 pixels per row
    rows=rows>>3;

    for (int r = 0; r < rows; r++)
    {
       // load 24 pixels (grayscale)
       uint8x8x3_t pixels     = vld3_u8(src);
       // first sum = d0 + d1
       uint8x8_t firstSum     = vadd_u8 ( pixels.val[0], pixels.val[1] );
       // second sum = d1+d2;
       uint8x8_t secondSum    = vadd_u8 ( firstSum,  pixels.val[2] );
       // total sum = d0+d1+d2
       uint8x8_t totalSum     = vadd_u8(secondSum, firstSum);
       // average = d0+d1+d2/8 ~9 for test
       uint8x8_t totalAverage = vshr_n_u8(totalSum,3);
       // store 8 bytes
       vst1_u8(dst, totalAverage);
       // move to next 3 rows
       src+=24;
       // move to next row
       dst+=8;

    }

}

I don't know what you're asking. The code in the link you provided is processing 8 *pixels* per row, not 8 bytes. — Carey Gregory, Mar 19 '13 at 18:28

FrankH. · Accepted Answer · 2013-03-21T10:03:19.113

2

For every scanline you process, you can use structure loads via vld3.8. If you have the starting addresses of the first, second and third line of pixels in r0..r2 then:

vld3.8 {d0,d1,d2}, [r0]
vld3.8 {d3,d4,d5}, [r1]
vld3.8 {d6,d7,d8}, [r2]

gives you

d0 has bytes [0,3,6,9,12,15,18,21] of the first line
d1 has bytes [1,4,7,10,13,16,19,22] of the first line
d2 has bytes [2,5,8,11,14,17,20,23] of the first line
same for d3..d5 for the 2nd line and d6..d8 for the third

Then average them all. You might want to extend to 16bit in order not to loose precision.

Edit: The total looks somewhat like (left the divide-by-nine out):

//
// load 3x8 bytes from three consecutive scanlines
//
uint8x8x3_t pixels[3] =
    { vld3_u8(src), vld3_u8(src + srcwidth), vld3_u8(src + 2*srcwidth) };

//
// expand them to 16bit so that the addition doesn't overflow
//
uint16x8_t wpix[9] =
    { vmovl_u8(pixels[0].val[0]),
      ...
      vmovl_u8(pixels[3].val[2]) };

//
// nine adds. Don't always add to wpix[0] because of possible dependencies.
//
wpix[0] = vaddq_u16(wpix[0], wpix[1]);
wpix[2] = vaddq_u16(wpix[2], wpix[3]);
wpix[4] = vaddq_u16(wpix[4], wpix[5]);
wpix[6] = vaddq_u16(wpix[6], wpix[7]);
wpix[0] = vaddq_u16(wpix[0], wpix[8]);

wpix[1] = vaddq_u16(wpix[2], wpix[4]);
wpix[3] = vaddq_u16(wpix[6], wpix[0]);
wpix[0] = vaddq_u16(wpix[1], wpix[3]);

[ .. divide-by-nine magic (in 16bit, aka for uint16x8_t), in wpix[0] ... ]
//
// truncate to 8bit and store back
//
vst1_u8(dst, vmovn_u16(wpix[0]);

Good luck !

edited Mar 21 '13 at 10:03

answered Mar 19 '13 at 18:34

FrankH.

17,675
3
44
63

what is the instruction for averaging by 9 ? After that I should store the resultant value from the average ? – andre_lamothe Mar 19 '13 at 18:50
2

There's no single instruction for that. Add them up and approximate division by 9. (d9 + d9>>3 - d9>>6) >> 3; is already quite close. – Aki Suihkonen Mar 19 '13 at 21:42
1

There's this famous example chapter from hacker's delight on _division by constants_, http://www.hackersdelight.org/divcMore.pdf which has an example how to code `div9` using only constant shifts and adds. That can be done completely in NEON instructions if necessary. Aki's code is an approximation of that. – FrankH. Mar 20 '13 at 09:13
@FrankH. I have updated my question to the new code, according to your answer – andre_lamothe Mar 20 '13 at 09:18
@Ahmed: A segmentation fault in your `vst` intrinsic means the target (`dst`) pointer is invalid. Can you run it inside a debugger and get the register state as well as the precise `PC` value / faulting instruction ? – FrankH. Mar 20 '13 at 09:39
@FrankH. I got it I think. I load d0,d1,d2, then I sum them and I average them, however I have to load d0,d1,d2 then d2,d3,d3, then d4,d5,d6 then sum all then average... correct? – andre_lamothe Mar 20 '13 at 09:42
@FrankH +1 for illustration – andre_lamothe Mar 21 '13 at 12:43

algorithm for downsample an image by 3 using Neon

1 Answers1