Assuming that you perform vector operations M
elements at a time (I think NEON is 128 bits wide, so that would be M=4
32-bit elements), you can unroll the difference equation by a factor of M
pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]
. Then, you can calculate the next four as follows:
y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
= (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...
In general, you can write y[n+k]
as:
y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]
I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n]
(which is assumed to be the last output calculated on the previous
vectorized iteration), you can calculate the next M
outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.
There are some caveats to this approach: if M
becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a
, this could have numerical precision implications. Also, you don't get an M
-fold speedup with this approach: you end up calculating y[n+k]
with what amounts to a k
-tap FIR filter. Although you're calculating M
outputs in parallel, the fact that you have to do k
multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.