ARM NEON simple low pass filter vectorization

Question

I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula:

y[n] = (1-a) * y[n-1] + a * x[n]

How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible? The problem is that every computation need a previous result.

My answer below talks specifically about how to restructure the problem to allow for parallel computation, but for any more specific answer containing specifics of the NEON implementation, one would need to know what number format you're using, etc. — Jason R, Jan 06 '12 at 16:44

score 3 · Answer 1 · answered Jan 05 '12 at 23:38

Assuming that you perform vector operations M elements at a time (I think NEON is 128 bits wide, so that would be M=4 32-bit elements), you can unroll the difference equation by a factor of M pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]. Then, you can calculate the next four as follows:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

In general, you can write y[n+k] as:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n] (which is assumed to be the last output calculated on the previous vectorized iteration), you can calculate the next M outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.

There are some caveats to this approach: if M becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a, this could have numerical precision implications. Also, you don't get an M-fold speedup with this approach: you end up calculating y[n+k] with what amounts to a k-tap FIR filter. Although you're calculating M outputs in parallel, the fact that you have to do k multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.

Is the vectorized version with 9 operations more efficient than using the orignal scalar version which has just three operations? Ok but overall the scalar one would have 4*3=12 ops, so probably slightly slower than vector, right? — André Bergner, Jan 06 '12 at 21:19
Yes, that's what I was getting at in my last paragraph; there isn't as big of a benefit as you would like in terms of operation count, only up to 50% instead of `1/M`. [A very similar version of this question is cross-posted at Signal Processing](http://dsp.stackexchange.com/questions/1075/how-can-i-vectorize-the-computations-for-a-first-order-recursive-filter), focusing more on the problem structure and operation count than anything NEON-specific. There are some additional details there. — Jason R, Jan 06 '12 at 21:23

score 0 · Answer 2 · answered Jan 05 '12 at 21:16

You can only really vectorize this if you have more than one signal to which you wish to apply the same filter, e.g. if it's a stereo audio signal then you can process the left and right channel in parallel. Four or eight channels in parallel would obviously be even better.

score 0 · Answer 3 · answered Jan 05 '12 at 22:37

In general, you can only vectorize completely independent sets of computations. But in your IIR low pass, every output is dependent on another (except the 1st), so vectorization is not possible.

If your variable "a" is large enough that (1-a)^n quickly decays to below your desired noise floor or allowed error, you could substitute a short FIR filter approximation for your IIR, and vectorize that convolution instead. But that's not likely to be faster.

score 0 · Answer 4 · answered Jan 05 '12 at 23:16

0

How about expanding equations to 4 steps and use matrix multiplication? a is constant so one matrix may be precalculated

answered Jan 05 '12 at 23:16

Tobby

1

ARM NEON simple low pass filter vectorization

4 Answers4