Explaining ARM Neon Image Sampling

Question

I'm trying to write a better version of cv::resize() of the OpenCV, and I came a cross a code that is here: https://github.com/rmaz/NEON-Image-Downscaling/blob/master/ImageResize/BDPViewController.m The code is for downsampling an image by 2 but I can not get the algorithm. I would like first to convert that algorithm to C then try to modify it for Learning purposes. Is it easy also to convert it to downsample by any size ?

The function is:

static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
    const uint32_t * rowB = src + pixelsPerRow;

    // force the number of pixels per row to a multiple of 8
    pixelsPerRow = 8 * (pixelsPerRow / 8);

    __asm__ volatile("Lresizeloop: \n" // start loop
                     "vld1.32 {d0-d3}, [%1]! \n" // load 8 pixels from the top row
                     "vld1.32 {d4-d7}, [%2]! \n" // load 8 pixels from the bottom row
                     "vhadd.u8 q0, q0, q2 \n" // average the pixels vertically
                     "vhadd.u8 q1, q1, q3 \n"
                     "vtrn.32 q0, q2 \n" // transpose to put the horizontally adjacent pixels in different registers
                     "vtrn.32 q1, q3 \n"
                     "vhadd.u8 q0, q0, q2 \n" // average the pixels horizontally
                     "vhadd.u8 q1, q1, q3 \n"
                     "vtrn.32 d0, d1 \n" // fill the registers with pixels
                     "vtrn.32 d2, d3 \n"
                     "vswp d1, d2 \n"
                     "vst1.64 {d0-d1}, [%0]! \n" // store the result
                     "subs %3, %3, #8 \n" // subtract 8 from the pixel count
                     "bne Lresizeloop \n" // repeat until the row is complete
: "=r"(dst), "=r"(src), "=r"(rowB), "=r"(pixelsPerRow)
: "0"(dst), "1"(src), "2"(rowB), "3"(pixelsPerRow)
: "q0", "q1", "q2", "q3", "cc"
);
}

To call it:

 // downscale the image in place
    for (size_t rowIndex = 0; rowIndex < height; rowIndex+=2)
    {
        void *sourceRow = (uint8_t *)buffer + rowIndex * bytesPerRow;
        void *destRow = (uint8_t *)buffer + (rowIndex / 2) * bytesPerRow;
        resizeRow(destRow, sourceRow, width);
    }

You found a pretty bad example : 1) It truncates with each half-add, thus the result is less accurate. 2) It's wasting valuable cycles with all those transposes in addition to being confusing. With VPADD and VPADAL instead of the half-adds, the function will be a lot faster (transposes are gone) and more accurate. (truncates just once) — Jake 'Alquimista' LEE, Jun 23 '13 at 05:35

BitBank · Accepted Answer · 2013-03-13T19:06:52.553

The algorithm is pretty straightforward. It reads 8 pixels from the current line and 8 from the line below. It then uses the vhadd (halving-add) instruction to average the 8 pixels vertically. It then transposes the position of the pixels so that the horizontally adjacent pixel pairs are now in separate registers (arranged vertically). It then does another set of halving-adds to average those together. The result is then transformed again to put them in their original positions and written to the destination. This algorithm could be rewritten to handle different integral sizes of scaling, but as written it can only do 2x2 to 1 reduction with averaging. Here's the C code equivalent:

static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
    uint8_t * pSrc8 = (uint8_t *)src;
    uint8_t * pDest8 = (uint8_t *)dst;
    int stride = pixelsPerRow * sizeof(uint32_t);
    int x;
    int r, g, b, a;

    for (x=0; x<pixelsPerRow; x++)
    {
       r = pSrc8[0] + pSrc8[4] + pSrc8[stride+0] + pSrc8[stride+4];
       g = pSrc8[1] + pSrc8[5] + pSrc8[stride+1] + pSrc8[stride+5];
       b = pSrc8[2] + pSrc8[6] + pSrc8[stride+2] + pSrc8[stride+6];
       a = pSrc8[3] + pSrc8[7] + pSrc8[stride+3] + pSrc8[stride+7];
       pDest8[0] = (uint8_t)((r + 2)/4); // average with rounding
       pDest8[1] = (uint8_t)((g + 2)/4);
       pDest8[2] = (uint8_t)((b + 2)/4);
       pDest8[3] = (uint8_t)((a + 2)/4);
       pSrc8 += 8; // skip forward 2 source pixels
       pDest8 += 4; // skip forward 1 destination pixel
    }

Amazing Answer. How are the pixels averaged vertically ? two pixels, then two pixels? and What is needed to support different size of scaling ? I tried it on paper I still don't get the transpose operation and the horizontally adjacent pixels :/ — andre_lamothe, Mar 13 '13 at 18:06
The vhadd instruction is equivalent to c = (a+b+1)/2. The first part of the code averages the pixels vertically because the NEON registers containing the top and bottom lines are averaged together. Since there is no vhadd that operates horizontally across NEON vector elements, the values need to be transposed so that horizontally adjacent elements are placed in separate registers. After transposition, they are averaged again (averaging the pixels "horizontally"). See a description of VTRN here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHDJAEA.html — BitBank, Mar 13 '13 at 18:12
It's correct as written. I write 4 bytes, then advance the destination pointer by 4 bytes. — BitBank, Mar 13 '13 at 20:37
Huh? 2D array? No, it's a simple pointer to a one dimensional row of pixels. Not sure where your confusion comes from. — BitBank, Mar 13 '13 at 21:03

Explaining ARM Neon Image Sampling

1 Answers1

Linked