0

I'm working on a better Hermite Interpolation code with 16bit Integer Streaming using SSE/SSE2 (or even SSE3/4/AVX...)

So far I got it running great, but I wonder if I could optimize it even further. And I also wonder if I could load faster the 16bit integer data.

Thanks for any advice.

Here's the original Hermite Interpolation code.

Hermite Interpolation
//
public static float InterpolateHermite4pt3oX(float x0, float x1, float x2, float x3, float t)
{
    float c0 = x1;
    float c1 = .5F * (x2 - x0);
    float c2 = x0 - (2.5F * x1) + (2 * x2) - (.5F * x3);
    float c3 = (.5F * (x3 - x0)) + (1.5F * (x1 - x2));
    return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}

Here's my SSE code so far.

static __m128 S0, S1, S2, S3;
static __m128 dot5 = _mm_set1_ps(0.5f);
static __m128 TwoDot5 = _mm_set1_ps(2.5f);
static __m128 OneDot5 = _mm_set1_ps(1.5f);
static __m128 One = _mm_set1_ps(1.0f);
static __m128 Two = _mm_set1_ps(2.0f);
static __m128 mul16b = _mm_set1_ps(BITS_16_MULT);

#define HIC0 S1
#define HIC1 _mm_mul_ps(dot5, _mm_sub_ps(S2, S0))
#define HIC2 _mm_sub_ps(_mm_add_ps(_mm_sub_ps(S0,  _mm_mul_ps(TwoDot5, S1)), _mm_mul_ps(Two, S2)), _mm_mul_ps(dot5, S3))
#define HIC3 _mm_add_ps(_mm_mul_ps(dot5, _mm_sub_ps(S3, S0)), _mm_mul_ps(OneDot5, _mm_sub_ps(S1, S2)))

#define HICRETURN _mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(HIC3, fract), HIC2), fract), HIC1), fract), HIC0)

__m128 fract = _mm_set1_ps(fractPos);
_mm_store_ps(tempWave, HICRETURN);

S0 to S3 are the samples, Sample0 to Sample3. FractPos is the fractional position from one sample to the next sample.

And for reading the samples I use:

int16* xData = (int16*)sampleData16Bits.getData();
tempWave[0] = float(xData[newPosition]);
tempWave[1] = float(xData[newPosition + 1]);
tempWave[2] = float(xData[newPosition + 2]);
tempWave[3] = float(xData[newPosition + 3]);
S0 = _mm_mul_ps(_mm_load_ps(tempWave), mul16b);
S1 = _mm_shuffle_ps(S0, S0, 0x39);
S2 = _mm_shuffle_ps(S1, S1, 0x39);
S3 = _mm_shuffle_ps(S2, S2, 0x39);
  • Some things I notice are that your factors for calculating the "c" values are all easy powers of 2. Instead of using floating point input and output, you could do these as integer calculations with shifts and additions. Also, the final "t" sum could potentially be done as a single multiplication and horizontal add, if you already had a vector containing `[t, t^2, t^3, t^4]` – paddy Nov 06 '18 at 21:06
  • I further notice (sorry, has been a long time since I did any intrinsics optimizations) that all you're really doing is what the compiler might have done (or worse). You never seem to build any vectors to compute the various "x" differences in a single operation, nor the multiplications as a single operation. It looks like mainly operating on single values instead of packed vectors. – paddy Nov 06 '18 at 21:09
  • I forgot to add that in stereo samples I add the other channel to another area of the S0...S3 variables, so each _ps calculation would take care of both channels in one go. – William Kalfelz Nov 06 '18 at 21:15
  • I'm changing my code to scalar for mono samples. ;-) – William Kalfelz Nov 06 '18 at 21:41
  • I suggest you expand out your code into single operations per line, then write a comment on each line about exactly what is in the vector and what's being calculated. I did this when rewriting a very complex sinc-function audio resampler using intrinsics. I wrote out my vectors in comments so that I could easily see what was being utilized. In this way I was able to process two sets of interleaved stereo samples at two separate resampling times, in a very unexpected way, which only became apparent when I exploded the algorithm and made all concepts highly visible. – paddy Nov 06 '18 at 22:05
  • I'm thinking on going into another direction, working 4 samples at once. My sample code could generate the samples for a 4 sample frame and I process the hermite interpolation last on those banks of 4 samples.. I will try this out and see how it goes, CPU wise. – William Kalfelz Nov 07 '18 at 11:54

0 Answers0