Bilinear Interpolation from C to Neon

Question

I'm trying to downsample an Image using Neon. So I tried to exercise neon by writing a function that subtracts two images using neon and I have succeeded. Now I came back to write the bilinear interpolation using neon intrinsics. Right now I have two problems, getting 4 pixels from one row and one column and also compute the interpolated value (gray) from 4 pixels or if it is possible from 8 pixels from one row and one column. I tried to think about it, but I think the algorithm should be rewritten at all ?

void resizeBilinearNeon( uint8_t *src, uint8_t *dest,  float srcWidth,  float srcHeight,  float destWidth,  float destHeight)
{

    int A, B, C, D, x, y, index;

       float x_ratio = ((float)(srcWidth-1))/destWidth ;
       float y_ratio = ((float)(srcHeight-1))/destHeight ;
       float x_diff, y_diff;

       for (int i=0;i<destHeight;i++) {
          for (int j=0;j<destWidth;j++) {
               x = (int)(x_ratio * j) ;
               y = (int)(y_ratio * i) ;
               x_diff = (x_ratio * j) - x ;
               y_diff = (y_ratio * i) - y ;
               index = y*srcWidth+x ;

               uint8x8_t pixels_r = vld1_u8 (src[index]);
               uint8x8_t pixels_c = vld1_u8 (src[index+srcWidth]);

               // Y = A(1-w)(1-h) + B(w)(1-h) + C(h)(1-w) + Dwh
               gray = (int)(
                           pixels_r[0]*(1-x_diff)*(1-y_diff) +  pixels_r[1]*(x_diff)*(1-y_diff) +
                           pixels_c[0]*(y_diff)*(1-x_diff)   +  pixels_c[1]*(x_diff*y_diff)
                           ) ;

               dest[i*w2 + j] = gray ;
           }
  }

A proper downsampling routine must take into account much more than 4 pixels for each output pixel. Bilinear is the wrong approach, it's no better than nearest neighbor in this application. — Mark Ransom, Mar 19 '13 at 14:28
@MarkRansom I have tried the normal nearst neighbour but the image quality affected my computer vision routines. The best that fitted my application is using bilinear interpolation, but the problem that the opencv function is slow. — andre_lamothe, Mar 19 '13 at 14:31
Bilinear works rather well in the range of +-50% scaling. On the other end one experiences pixellation and at the other end frequency aliasing (e.g. Moire effect). The bottleneck in parallelization / vectorization is the necessity to access at least 4 "random" memory elements per pixel and to generate their effective addresses; and the solution is to use pshufb on Intel and vtbl on Neon to access 8, 16, or even 32 (ymm) individual bytes from a very local look-up-table. — Aki Suihkonen, Mar 19 '13 at 20:39

Aki Suihkonen · Answer 1 · 2013-03-19T19:10:04.553

Neon will definitely help with downsampling in an arbitrary ratio using bilinear filtering. The key being clever use of vtbl.8 instruction, that is able to perform a parallel look-up-table for 8 consecutive destination pixels from pre-loaded array:

 d0 = a [b] c [d] e [f]  g  h, d1 =  i  j  k  l  m  n  o  p 
 d2 = q  r  s  t  u  v  [w] x, d3 = [y] z [A] B [C][D] E  F ...
 d4 = G  H  I  J  K  L   M  N, d5 =  O  P  Q  R  S  T  U  V ...

One can easily calculate the fractional positions for the pixels in brackets:

 [b] [d] [f] [w] [y] [A] [C] [D],  accessed with vtbl.8 d6, {d0,d1,d2,d3}
 The row below would be accessed with            vtbl.8 d7, {d2,d3,d4,d5}

Incrementing vadd.8 d6, d30 ; with d30 = [1 1 1 1 1 ... 1] gives lookup indices for the pixels right of the origin etc.

There's no reason for getting the pixels from two rows other than illustrating it's possible and that the method can be used to implement also slight distortions if needed.

In real time applications using e.g. of lanzcos can be a bit overkill, but still feasible using NEON. Downsampling of larger factors require of course (heavy) filtering, but can be easily achieved with iteratively averaging and decimating by 2:1 and only at the end using fractional sampling.

For any 8 consecutive pixels to write, one can calculate the vector

  x_positions = (X + [0 1 2 3 4 5 6 7]) * source_width / target_width;
  y_positions = (Y + [0 0 0 0 0 0 0 0]) * source_height / target_height;

  ptr = to_int(x_positions) + y_positions * stride;
  x_position += (ptr & 7); // this pointer arithmetic goes only for 8-bit planar
  ptr &= ~7;               // this is to adjust read pointer to qword alignment

  vld1.8 {d0,d1}, [r0]
  vld1.8 {d2,d3], [r0], r2 // wasn't this possible? (use r2==stride)

  d4 = int_part_of (x_positions);
  d5 = d4 + 1;
  d6 = fract_part_of (x_positions);
  d7 = fract_part_of (y_positions);

  vtbl.8 d8,d4,{d0,d1}  // read top row
  vtbl.8 d9,d5,{d0,d1}  // read top row +1
  MIX(d8,d9,d6)             // horizontal mix of ptr[] & ptr[1]
  vtbl.8 d10,d4,{d2,d3} // read bottom row
  vtbl.8 d11,d5,{d2,d3} // read bottom row
  MIX(d10,d11,d6)           // horizontal mix of ptr[1024] & ptr[1025]
  MIX(d8,d10,d7)

  // MIX (dst, src, fract) is a macro that somehow does linear blending
  // should be doable with ~3-4 instructions

To calculate the integer parts, it's enough to use 8.8 bit resolution (one really doesn't have to calculate 666+[0 1 2 3 .. 7]) and keep all intermediate results in simd register.

Disclaimer -- this is conceptual pseudo c / vector code. In SIMD there are two parallel tasks to be optimized: what's the minimum amount of arithmetic operations needed and how to minimize unnecessary shuffling / copying of data. In this respect too NEON with three register approach is much better suited to serious DSP than SSE. The second respect is the amount of multiplication instruction and the third advantage the interleaving instructions.

I've never figured out how to properly decimate by 2:1 when the input dimension is odd. Especially when you repeat it the errors at the edge will accumulate. — Mark Ransom, Mar 19 '13 at 16:27
[bragging mode] My memory probably serves incorrect, but on 600MHz Cortex-A8 a NEON implementation of mine was able to run 30 FPS bilinear interpolation on irregular grid for stereographic (3D) video alignment: 3D-axis calibration, scaling, barrel correction, histogram equalization and more -- all parameters variable at runtime. Keywords: neon and bilinear. [\bragging mode off] — Aki Suihkonen, Mar 19 '13 at 20:24
@Ahi Suihkonen: If you provide a pointer to the sourcecode for the combined pipeline, I'll upvote your bragging comment ;-) — FrankH., Mar 20 '13 at 10:27
I'm afraid delivering some actual code is out of the question for legal reasons alone. But I can reconstruct selected parts / answer to specific questions. — Aki Suihkonen, Mar 20 '13 at 10:34

score 1 · Accepted Answer · answered Mar 19 '13 at 14:46

1

@MarkRansom is not correct about nearest neighbor versus 2x2 bilinear interpolation; bilinear using 4 pixels will produce better output than nearest neighbor. He is correct that averaging the appropriate number of pixels (more than 4 if scaling by > 2:1) will produce better output still. However, NEON will not help with image downsampling unless the scaling is done by an integer ratio.

The maximum benefit of NEON and other SIMD instruction sets is to be able to process 8 or 16 pixels at once using the same operations. By accessing individual elements the way you are, you lose all the SIMD benefit. Another problem is that moving data from NEON to ARM registers is a slow operation. Downsampling images is best done by a GPU or optimized ARM instructions.

answered Mar 19 '13 at 14:46

BitBank

8,500
3
28
46

I have tried the GPU Approach, and reading data back from the GPU is more slow then OpenCV!. This is my problem now, I don't know how to work with the 8pixels at one time so that I do the equivalent C code. I wish that you help with that, at least I see it by my eyes :). – andre_lamothe Mar 19 '13 at 14:55
It's not a problem of seeing it through your eyes, it's a problem that's hard to solve in a vectorized way. If you can limit the problem to resampling the image by a fixed amount (e.g. 2:1), then you can write an optimized NEON solution. – BitBank Mar 19 '13 at 15:12
That's why I wanted to see it by my eyes. I thought a lot about it on paper using 8bytes vector, and it's very hard to loop thru the columns and do the processing. How about a function that does from 1280*960 to 400*300 ? then write another one for another resolution ? – andre_lamothe Mar 19 '13 at 15:31
1

A faster way would probably be 1280x960 -> 640x480 with NEON, then 640x480->400x300 with an optimized C/ASM routine. – BitBank Mar 19 '13 at 16:10
My comment was based on downsampling ratios much greater than 2:1, creating thumbnails for example; since the question didn't mention ratios I stand by my original statement. In the 1:1 to 2:1 range you might get acceptable results due to the smoothing effect of bilinear which is equivalent to a tent filter. – Mark Ransom Mar 19 '13 at 16:24
1280x960 -> 640x480 would best be done with a simple 4x4 average, no need to do bilinear at all. – Mark Ransom Mar 19 '13 at 16:25
@BitBank I would do a down sample by 2 using NEON, and gonna use opencv resize function after that. I hope that accelerates the process. – andre_lamothe Mar 19 '13 at 16:30
It may be more accurate to say that using NEON to help with downsampling by non-integer ratios is difficult, not that NEON will not help. – Eric Postpischil Mar 19 '13 at 16:37

Bilinear Interpolation from C to Neon

2 Answers2