0

I would like to know is it possible to with neon vectors to downsample an image by 3 ? I'm trying to write an algorithm for that on paper, but it seems it is not possible. Because when you get for example 8 bytes, you can not get 3*3pixels, there won't be enough pixels to complete the downsampling operation. According to the downsample by 2: Explaining ARM Neon Image Sampling I think about loading 16bytes, then 8bytes from one row, then assign them to a 32bytes vector, then process it 24 bytes of that vector?

Update: I have written a sample code according to the answer, but I get a segmentation fault in the vst1_u8...

inline void downsample3dOnePass( uint8_t* src, uint8_t *dst, int srcWidth)
{

    // make sure rows/cols dividable by 8
    int rows = ((srcWidth>>3)<<3);
    // 8 pixels per row
    rows=rows>>3;

    for (int r = 0; r < rows; r++)
    {
       // load 24 pixels (grayscale)
       uint8x8x3_t pixels     = vld3_u8(src);
       // first sum = d0 + d1
       uint8x8_t firstSum     = vadd_u8 ( pixels.val[0], pixels.val[1] );
       // second sum = d1+d2;
       uint8x8_t secondSum    = vadd_u8 ( firstSum,  pixels.val[2] );
       // total sum = d0+d1+d2
       uint8x8_t totalSum     = vadd_u8(secondSum, firstSum);
       // average = d0+d1+d2/8 ~9 for test
       uint8x8_t totalAverage = vshr_n_u8(totalSum,3);
       // store 8 bytes
       vst1_u8(dst, totalAverage);
       // move to next 3 rows
       src+=24;
       // move to next row
       dst+=8;

    }

}
Community
  • 1
  • 1
andre_lamothe
  • 2,171
  • 2
  • 41
  • 74

1 Answers1

2

For every scanline you process, you can use structure loads via vld3.8. If you have the starting addresses of the first, second and third line of pixels in r0..r2 then:

vld3.8 {d0,d1,d2}, [r0]
vld3.8 {d3,d4,d5}, [r1]
vld3.8 {d6,d7,d8}, [r2]

gives you

  • d0 has bytes [0,3,6,9,12,15,18,21] of the first line
  • d1 has bytes [1,4,7,10,13,16,19,22] of the first line
  • d2 has bytes [2,5,8,11,14,17,20,23] of the first line
  • same for d3..d5 for the 2nd line and d6..d8 for the third

Then average them all. You might want to extend to 16bit in order not to loose precision.

Edit: The total looks somewhat like (left the divide-by-nine out):

//
// load 3x8 bytes from three consecutive scanlines
//
uint8x8x3_t pixels[3] =
    { vld3_u8(src), vld3_u8(src + srcwidth), vld3_u8(src + 2*srcwidth) };

//
// expand them to 16bit so that the addition doesn't overflow
//
uint16x8_t wpix[9] =
    { vmovl_u8(pixels[0].val[0]),
      ...
      vmovl_u8(pixels[3].val[2]) };

//
// nine adds. Don't always add to wpix[0] because of possible dependencies.
//
wpix[0] = vaddq_u16(wpix[0], wpix[1]);
wpix[2] = vaddq_u16(wpix[2], wpix[3]);
wpix[4] = vaddq_u16(wpix[4], wpix[5]);
wpix[6] = vaddq_u16(wpix[6], wpix[7]);
wpix[0] = vaddq_u16(wpix[0], wpix[8]);

wpix[1] = vaddq_u16(wpix[2], wpix[4]);
wpix[3] = vaddq_u16(wpix[6], wpix[0]);
wpix[0] = vaddq_u16(wpix[1], wpix[3]);

[ .. divide-by-nine magic (in 16bit, aka for uint16x8_t), in wpix[0] ... ]
//
// truncate to 8bit and store back
//
vst1_u8(dst, vmovn_u16(wpix[0]);

Good luck !

FrankH.
  • 17,675
  • 3
  • 44
  • 63
  • what is the instruction for averaging by 9 ? After that I should store the resultant value from the average ? – andre_lamothe Mar 19 '13 at 18:50
  • 2
    There's no single instruction for that. Add them up and approximate division by 9. (d9 + d9>>3 - d9>>6) >> 3; is already quite close. – Aki Suihkonen Mar 19 '13 at 21:42
  • 1
    There's this famous example chapter from hacker's delight on _division by constants_, http://www.hackersdelight.org/divcMore.pdf which has an example how to code `div9` using only constant shifts and adds. That can be done completely in NEON instructions if necessary. Aki's code is an approximation of that. – FrankH. Mar 20 '13 at 09:13
  • @FrankH. I have updated my question to the new code, according to your answer – andre_lamothe Mar 20 '13 at 09:18
  • @Ahmed: A segmentation fault in your `vst` intrinsic means the target (`dst`) pointer is invalid. Can you run it inside a debugger and get the register state as well as the precise `PC` value / faulting instruction ? – FrankH. Mar 20 '13 at 09:39
  • @FrankH. I got it I think. I load d0,d1,d2, then I sum them and I average them, however I have to load d0,d1,d2 then d2,d3,d3, then d4,d5,d6 then sum all then average... correct? – andre_lamothe Mar 20 '13 at 09:42
  • @FrankH +1 for illustration – andre_lamothe Mar 21 '13 at 12:43