Accelerate framework used, no observable speedup

Question

I have the following piece of audio code that I thought would be a good candidate for using vDSP in accelerate framework.

// --- get pointers for buffer lists
float* left = (float*)audio->mBuffers[0].mData;
float* right = numChans == 2 ? (float*)audio->mBuffers[1].mData : NULL;

float dLeftAccum = 0.0;
float dRightAccum = 0.0;

float fMix = 0.25; // -12dB HR per note

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += fMix*(float)dLeft;
        dRightAccum += fMix*(float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}

// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

Thus I modified it to use vDSP, multiplying fMix at the end of the block of frames.

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += (float)dLeft;
        dRightAccum += (float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}
vDSP_vsmul(left, 1, &fMix, left, 1, inNumberFrames);
vDSP_vsmul(right, 1, &fMix, right, 1, inNumberFrames);
// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

However, my CPU usage still remains the same. I see no perceptible benefit of using vDSP here. Am I doing this correctly? Many thanks.

Still new to vector operations, go easy on me :)

If there are some obvious optimizations that I should be doing (outside of accelerate framework), feel free to point it out to me, thanks!

Assuming here that m_pVoiceArray[i]->doVoice(dLeft, dRight); is modifying dLeft and dRight (because they are passed by reference and this is C++?) I would have the doVoice function produce a bunch of samples at a time, not just one. You are probably spending most of you time in overhead looking up data and making function calls. That is, reverse the order of the loops in your frame processing loop. Otherwise, I would like to introduce you to my friend the multiply operator. — Ian Ollmann, Feb 27 '15 at 23:56
Probably it didn't speed up much because most of your time per sample was spent in the MAX_VOICES loop. You can verify that by looking at time usage per line of source in Instruments. A couple of multiplies like what you pulled out of the loop is trivial compared to the cost of calling a function pointer per (sample * voice). — Ian Ollmann, Mar 09 '15 at 21:04
Yup I have been doing that since... there are some places where calculations can be pulled out of the main processing loop as u said, or computed less often. — lppier, Mar 10 '15 at 03:00

score 1 · Accepted Answer · answered Feb 26 '15 at 04:10

You're vector call is performing 2 multiplies per sample at audio sample rates. If your sample rate was 192kHz then you're only talking about 384000 multiplies per second - not really enough to register on a modern CPU. Moreover, you're moving existing multiplies to another place. If you had a look at the generated assembly I would guess that the compiler optimized your original code pretty decently and any speed up in the vDSP call is going to be offset by the fact that you are requiring a second loop.

Another important thing to note is that all of the vDSP functions are going to work better on when the vector data is aligned on a 16-byte boundary. If you take a look at the SSE2 instruction set (which I'm sure vDSP uses heavily) you'll see that many instructions have a version for aligned data and another version for unaligned data.

The way you would align data in gcc is something like this:

float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8} __attribute__ ((aligned(16)));

Or if you're allocating on the heap look if aligned_malloc is available.

Accelerate framework used, no observable speedup

1 Answers1