Instead of using AudioClip
you could use SourceDataLine
for playback. This class allows you to progressively read the audio data, exposing it for handling. You would have to decode from bytes to PCM for each incoming line, then add the PCM from all the lines to be merged, and recode that back to bytes for your output.
I suspect with a little minor tweaking you could get the library I wrote AudioCue
to work for you. It has an optional mixer that will handle multiple cue inputs. The inputs make use of SourceDataLine
and the mixer uses the logic I described. You would have to tweak the code to output to disk. I could probably help with that if this project is still live for you.