0

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?

My scenario is the following:

A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).

However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.

Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?

System: 10.7.x

CPU: quad-core i7

Thanks in advance,

CA

user3078414
  • 1,942
  • 2
  • 16
  • 24

4 Answers4

4

This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...

Adam Bryant
  • 545
  • 3
  • 13
  • Thank you very much. About audio threading priority policy in OSX I'm well aware (and I find it one of the better concepts of OSX), although I've never encountered any Apple documentation stating that a single audio thread was premeditatedly hard-coded as multi-cpu environment unaware. Doing so would be strategically unwise. – user3078414 Mar 18 '14 at 10:12
  • I presume that proper implementation of audio-thread priority feature should be system-wide, not "lowest-count-cpu-limited". As far as my code is concerned,no "larger blocks of samples" being read from a stream (such as in convolution reverb), every single Float32 sample has to be explicitly generated within a single AudioUnit callback cycle, in blocks of "inNumberFrames". All other threads are insignificant in terms of data crunching. – user3078414 Mar 18 '14 at 10:36
  • It had taken few months of experimenting, learning and debugging to arrive to the point of being able to look back and accept your answer as most useful. There's a number of follow-ups which cover the learning curve. Thanks! – user3078414 Jan 31 '16 at 19:13
0

Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.

If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.

jtomschroeder
  • 1,034
  • 7
  • 11
  • Thanks for your valuable answer. I already thought of vDSP as of something worth invest(igat)ing some time into, at a certain development stage, although I doubt it can significantly beat my optimized code allowing for no-latency interpolation-synthesizing up to 1024 time-varying sinusoidal partials per CPU. There are other timing issues, such as time-varying number of partials. I'd prefer avoiding AUGraphs and CARingBuffers. Yet if vDSP can be applied inside a plain AU render callback, making the thread multi-cpu aware, it would make a difference. Any experience in this field? – user3078414 Mar 18 '14 at 21:25
  • It's tough to say how much vDSP would help without seeing the code, but basically, vectorization is faster than serial execution. vDSP can be used anywhere, including inside of an AU render callback (using vDSP will not cause you to use AUGraph's or CARingBuffer). The `ioData` parameter of `AURenderCallback` can be treated as a vector, and so can *n* sinusoidal partials. Apply any operations you need using vector-style functions: fill it, sum it, apply a scalar amplitude, or a vector of amplitudes, etc. – jtomschroeder Mar 18 '14 at 21:48
0

Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short. First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.

This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...

If someone could point me at an example of positive practice, I'd be very grateful.

// Here's the simplified code snippet:

OSStatus AdditiveRenderProc(
                void *inRefCon, 
                AudioUnitRenderActionFlags  *ioActionFlags, 
                const AudioTimeStamp        *inTimeStamp, 
                UInt32                      inBusNumber, 
                UInt32                      inNumberFrames, 
                AudioBufferList             *ioData)

{

// local variables' declaration and behaviour-setting conditional statements

// some local variables are here for debugging convenience

// {...    ...   ...}

// Get the time-breakpoint parameters out of the gen struct

AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;

// compute interpolated values for each partial's each frame    
// {deltaf[p]...    ampf[p][frame]...   ...}

//here comes the brute-force "processor eater" (single channel only!)

Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;

for (UInt32 frame = 0; frame < inNumberFrames; frame++) 
{   
buf[frame] = 0.;

for(UInt32 p = 0; p < candidates; p++){
    if(gen->partialFrequencyf[p] < NYQUISTF)            
    buf[frame] += sinf(phasef[p]) * ampf[p][frame];         

    phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);         

    if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;        
    }
buf[frame] *= ovampf[frame];
}

for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{...   ...   ...}
;
}    

return noErr; }

user3078414
  • 1,942
  • 2
  • 16
  • 24
0

As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.

Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.

Sorry I couldn't help more, but I thought those few bits of info might be helpful.

user5283101
  • 1
  • 1
  • 1
  • Thanks for your reply AlexKenis, just to avoid misunderstanding, I was looking for ways of writing the **output render callback** so the system recognize that more than one processor core is available to complete the required number crunching. Or to hard-code it in a way it does this kind of _housekeeping_ itself. Also, I believe I have written this before, under **AU** I haven't presumed the plug-in functionality and AULab testing, but the CoreAudio **low-level API** instead. – user3078414 Apr 16 '14 at 11:02