-1

I am managing audio capturing and playing using java sound API (targetDataLine and sourceDataLine). Now suppose in a conference environment, one participant's audio queue size got greater than jitter size (due to processing or network) and I want to fast forward the audio bytes I have of that participant to make it shorter than jitter size.

How can I fast forward the audio byte array of that participant?

I can't do it during playing as normally Player thread just deque 1 frame from every participant's queue and mix it for playing. The only way I can get that is if I deque more than 1 frame of that participant and mix(?) it for fast-forwarding before mixing it with other participants 1 dequeued frame for playing? Thanks in advance for any kind of help or advice.

Nafiul Alam Fuji
  • 407
  • 7
  • 17
  • fast forward means skip then skip - fast forward means something else then do the something else – gpasch Nov 23 '21 at 14:16
  • you use some kind of terminology - I don't know where 'jitter' means the queue size!! -so better clean up your terminology and come back! – gpasch Nov 23 '21 at 14:23
  • jitter is the limit I will allow the queue size to grow normally.webRTC dynamically handles this. In a conference,I have an audio queue of packets for every participant and a mixer thread which dequeue packets from each and mix them for playing,but for some conditions, some participants' packets can be delayed and his audio queue size will start growing as I only dequeue 1 packet in each thread processing. so if the queue size grows more than the threshold(jitter) I will fast forward his audio(dequeuing more packets and process) and mix it with others.hope I cleared and please clear neg votes! – Nafiul Alam Fuji Nov 24 '21 at 02:24

2 Answers2

2

There are two ways to speed up the playback that I know of. In one case, the faster pace creates a rise in pitch. The coding for this is relatively easy. In the other case, pitch is kept constant, but it involves a technique of working with sound granules (granular synthesis), and is harder to explain.

For the situation where maintaining the same pitch is not a concern, the basic plan is as follows: instead of advancing by single frames, advance by a frame + a small increment. For example, let's say that advancing 1.1 frames over a course of 44000 frames is sufficient to catch you up. (That would also mean that the pitch increase would be about 1/10 of an octave.)

To advance a "fractional" frame, you first have to convert the bytes of the two bracketing frames to PCM. Then, use linear interpolation to get the intermediate value. Then convert that intermediate value back to bytes for the output line.

For example, if you are advancing from frame[0] to frame["1.1"] you will need to know the PCM for frame[1] and frame[2]. The intermediate value can be calculated using a weighted average:

value = PCM[1] * 9/10 + PCM[2] * 1/10

I think it might be good to make the amount by which you advance change gradually. Take a few dozen frames to ramp up the increment and allow time to ramp down again when returning to normal dequeuing. If you suddenly change the rate at which you are reading the audio data, it is possible to introduce a discontinuity that will be heard as a click.

I have used this basic plan for dynamic control of playback speed, but I haven't had the experience of employing it for the situation that you are describing. Regulating the variable speed could be tricky if you also are trying to enforce keeping the transitions smooth.

The basic idea for using granules involves obtaining contiguous PCM (I'm not clear what the optimum number of frames would be for voice, 1 to 50 millis is cited as commonly being used with this technique in synthesis), and giving it a volume envelope that allows you to mix sequential granules end-to-end (they must overlap).

I think the envelopes for the granules make use of a Hann function or Hamming window--but I'm not clear on the details, such as the overlapping placement of the granules so that they mix/transition smoothly. I've only dabbled, and I'm going to assume folks at Signal Processing will be the best bet for advice on how to code this.

Phil Freihofner
  • 7,645
  • 1
  • 20
  • 41
0

I found a fantastic git repo (sonic library, mainly for audio player) which actually does exactly what I wanted with so much controls. I can input a whole .wav file or even chunks of audio byte arrays and after processing, we can get speed up play experience and so more. For real time processing I actually called this on every chunk of audio byte array.

I found another way/algo to detect whether a audio chunk/byte array is voice or not and after depending on it's result, I can simply ignore playing non voice packets which gives us around 1.5x speedup with less processing.

public class DTHVAD {
public static final int INITIAL_EMIN = 100;
public static final double INITIAL_DELTAJ = 1.0001;
private static boolean isFirstFrame;
private static double Emax;
private static double Emin;
private static int inactiveFrameCounter;
private static double Lamda; //
private static double DeltaJ;

static {
    initDTH();
}

private static void initDTH() {
    Emax = 0;
    Emin = 0;
    isFirstFrame = true;
    Lamda = 0.950; // range is 0.950---0.999
    DeltaJ = 1.0001;
}

public static boolean isAllSilence(short[] samples, int length) {
    boolean r = true;
    for (int l = 0; l < length; l += 80) {
        if (!isSilence(samples, l, l+80)) {
            r = false;
            break;
        }
    }
    return r;
}

public static boolean isSilence(short[] samples, int offset, int length) {

    boolean isSilenceR = false;
    long energy = energyRMSE(samples, offset, length);
    // printf("en=%ld\n",energy);

    if (isFirstFrame) {
        Emax = energy;
        Emin = INITIAL_EMIN;
        isFirstFrame = false;

    }

    if (energy > Emax) {
        Emax = energy;
    }

    if (energy < Emin) {

        if ((int) energy == 0) {
            Emin = INITIAL_EMIN;

        } else {
            Emin = energy;

        }
        DeltaJ = INITIAL_DELTAJ; // Resetting DeltaJ with initial value

    } else {
        DeltaJ = DeltaJ * 1.0001;
    }

    long thresshold = (long) ((1 - Lamda) * Emax + Lamda * Emin);
    // printf("e=%ld,Emin=%f, Emax=%f, thres=%ld\n",energy,Emin,Emax,thresshold);
    Lamda = (Emax - Emin) / Emax;

    if (energy > thresshold) {

        isSilenceR = false; // voice marking

    } else {
        isSilenceR = true; // noise marking

    }

    Emin = Emin * DeltaJ;

    return isSilenceR;
}

private static long energyRMSE(short[] samples, int offset, int length) {
    double cEnergy = 0;
    float reversOfN = (float) 1 / length;
    long step = 0;

    for (int i = offset; i < length; i++) {
        step = samples[i] * samples[i]; // x*x/N=
        // printf("step=%ld cEng=%ld\n",step,cEnergy);
        cEnergy += (long) ((float) step * reversOfN);// for length =80
        // reverseOfN=0.0125

    }
    cEnergy = Math.pow(cEnergy, 0.5);
    return (long) cEnergy;

}

}

Here I can convert my byte array to short array and detect whether it is voice or non voice by

frame.silence = DTHVAD.isSilence(encodeShortBuffer, 0, shortLen);

Nafiul Alam Fuji
  • 407
  • 7
  • 17