FFMPEG: multiplexing streams with different duration

Question

I am multiplexing video and audio streams. Video stream comes from generated image data. The audio stream comes from aac file. Some audio files are longer than total video time I set so my strategy to stop audio stream muxer when its time becomes larger than the total video time(the last one I control by number encoded video frames).

I won't put here the whole setup code, but it is similar to muxing.c example from the latest FFMPEG repo. The only difference is that I use audio stream from file,as I said, not from synthetically generated encoded frame. I am pretty sure the issue is in my wrong sync during muxer loop.Here is what I do:

void AudioSetup(const char* audioInFileName)
{
    AVOutputFormat* outputF = mOutputFormatContext->oformat;
    auto audioCodecId = outputF->audio_codec;

    if (audioCodecId == AV_CODEC_ID_NONE) {
        return false;
    }

    audio_codec = avcodec_find_encoder(audioCodecId);

    avformat_open_input(&mInputAudioFormatContext,
    audioInFileName, 0, 0);
    avformat_find_stream_info(mInputAudioFormatContext, 0);

    av_dump_format(mInputAudioFormatContext, 0, audioInFileName, 0);


    for (size_t i = 0; i < mInputAudioFormatContext->nb_streams; i++) {
        if (mInputAudioFormatContext->streams[i]->codecpar->codec_type == AVMEDIA_TYPE_AUDIO) {
            inAudioStream = mInputAudioFormatContext->streams[i];

            AVCodecParameters *in_codecpar = inAudioStream->codecpar;
            mAudioOutStream.st = avformat_new_stream(mOutputFormatContext, NULL);
            mAudioOutStream.st->id = mOutputFormatContext->nb_streams - 1;
            AVCodecContext* c = avcodec_alloc_context3(audio_codec);
            mAudioOutStream.enc = c;
            c->sample_fmt = audio_codec->sample_fmts[0];
            avcodec_parameters_to_context(c, inAudioStream->codecpar);
            //copyparams from input to autput audio stream:
            avcodec_parameters_copy(mAudioOutStream.st->codecpar, inAudioStream->codecpar);

            mAudioOutStream.st->time_base.num = 1;
            mAudioOutStream.st->time_base.den = c->sample_rate;

            c->time_base = mAudioOutStream.st->time_base;

            if (mOutputFormatContext->oformat->flags & AVFMT_GLOBALHEADER) {
                c->flags |= CODEC_FLAG_GLOBAL_HEADER;
            }
            break;
        }
    }
}

void Encode()
{
    int cc = av_compare_ts(mVideoOutStream.next_pts, mVideoOutStream.enc->time_base,
    mAudioOutStream.next_pts, mAudioOutStream.enc->time_base);

    if (mAudioOutStream.st == NULL || cc <= 0) {
        uint8_t* data = GetYUVFrame();//returns ready video YUV frame to work with
        int ret = 0;
        AVPacket pkt = { 0 };
        av_init_packet(&pkt);
        pkt.size = packet->dataSize;
        pkt.data = data;
        const int64_t duration = av_rescale_q(1, mVideoOutStream.enc->time_base, mVideoOutStream.st->time_base);

        pkt.duration = duration;
        pkt.pts = mVideoOutStream.next_pts;
        pkt.dts = mVideoOutStream.next_pts;
        mVideoOutStream.next_pts += duration;

        pkt.stream_index = mVideoOutStream.st->index;
        ret = av_interleaved_write_frame(mOutputFormatContext, &pkt);
    } else
    if(audio_time <  video_time) {
        //5 -  duration of video in seconds
        AVRational r = {  60, 1 };

        auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
        if (cmp >= 0) {
            mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
            return true; //don't mux audio anymore
        }

        AVPacket a_pkt = { 0 };
        av_init_packet(&a_pkt);

        int ret = 0;
        ret = av_read_frame(mInputAudioFormatContext, &a_pkt);
        //if audio file is shorter than stop muxing when at the end of the file
        if (ret == AVERROR_EOF) {
            mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max(); 
            return true;
        }
        a_pkt.stream_index = mAudioOutStream.st->index;

        av_packet_rescale_ts(&a_pkt, inAudioStream->time_base, mAudioOutStream.st->time_base);
        mAudioOutStream.next_pts += a_pkt.pts;

        ret = av_interleaved_write_frame(mOutputFormatContext, &a_pkt);
    }
}

Now, the video part is flawless. But if the audio track is longer than video duration, I am getting total video length longer by around 5% - 20%, and it is clear that audio is contributing to that as video frames are finished exactly where there're supposed to be.

The closest 'hack' I came with is this part:

AVRational r = {  60 ,1 };
auto cmp= av_compare_ts(mAudioOutStream.next_pts, mAudioOutStream.enc->time_base, 5, r);
if (cmp >= 0) {
    mAudioOutStream.next_pts = (int64_t)std::numeric_limits<int64_t>::max();
    return true;
}

Here I was trying to compare next_pts of the audio stream with the total time set for video file,which is 5 seconds. By setting r = {60,1} I am converting those seconds by the time_base of the audio stream. At least that's what I believe I am doing. With this hack, I am getting very small deviation from the correct movie length when using standard AAC files,that's sample rate of 44100,stereo. But if I test with more problematic samples,like AAC sample rate 16000,mono - then the video file adds almost a whole second to its size. I will appreciate if someone can point out what I am doing wrong here.

Important note: I don't set duration on for any of the contexts. I control the termination of the muxing session, which is based on video frames count.The audio input stream has duration, of course, but it doesn't help me as video duration is what defines the movie length.

UPDATE:

This is second bounty attempt.

UPDATE 2:

Actually,my audio timestamp of {den,num} was wrong,while {1,1} is indeed the way to go,as explained by the answer. What was preventing it from working was a bug in this line (my bad):

     mAudioOutStream.next_pts += a_pkt.pts;

Which must be:

     mAudioOutStream.next_pts = a_pkt.pts;

The bug resulted in exponential increment of pts,which caused very early reach to the end of stream (in terms of pts) and therefore caused the audio stream to be terminated much earlier than it supposed to be.

Award it to whom? Really noone on SO can't answer this question? — Michael IV, Mar 27 '18 at 18:51
Truncate - not an option. The Second thing - that's what I am trying to do. Look at my code. — Michael IV, Apr 09 '18 at 16:01
"I'm not familiar with the FFMPEG API" - please, I am more than familiar with the API, and this is not a trivial issue. I wouldn't give away 250 points for something simple. Btw, I intentionally flip that timebase as it was the only variant which gave me something 'relatively' plausible to work with... using {1,1} as in that example doesn't return anything meaningful in my case. I guess that's because my audio stream is not synthetic as the one used by the example. The time base of the audio stream from a file is set by the codec context and may look quite weird. — Michael IV, Apr 09 '18 at 17:26
'Please ' is rude? Come on dude, if you can't help, don't blame me for that. I am into this bug for too long in order not to be aware of FFMPEG documentation. — Michael IV, Apr 09 '18 at 17:57
In short achieving something similar to https://stackoverflow.com/questions/13041061/mix-audio-video-of-different-lengths-with-ffmpeg with C++ code? — Tarun Lalwani, Apr 10 '18 at 12:57
@TarunLalwani Correct. The total length of the final video file must be dictated by the video length,which is in my case controlled by my app (counting number of encoded video frames). — Michael IV, Apr 10 '18 at 12:59

Max Vollmer · Accepted Answer · 2018-04-16T08:42:44.610

The problem is that you tell it to compare the given audio time with 5 ticks at 60 seconds per tick. I am actually surprised that it works in some cases, but I guess it really depends on the specific time_base of the given audio stream.

Let's assume the audio has a time_base of 1/25 and the stream is at 6 seconds, which is more than you want, so you want av_compare_ts to return 0 or 1. Given these conditions, you'll have the following values:

mAudioOutStream.next_pts = 150
mAudioOutStream.enc->time_base = 1/25

Thus you call av_compare_ts with the following parameters:

ts_a = 150
tb_a = 1/25
ts_b = 5
tb_b = 60/1

Now let's look at the implementation of av_compare_ts:

int av_compare_ts(int64_t ts_a, AVRational tb_a, int64_t ts_b, AVRational tb_b)
{
    int64_t a = tb_a.num * (int64_t)tb_b.den;
    int64_t b = tb_b.num * (int64_t)tb_a.den;
    if ((FFABS(ts_a)|a|FFABS(ts_b)|b) <= INT_MAX)
        return (ts_a*a > ts_b*b) - (ts_a*a < ts_b*b);
    if (av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b)
        return -1;
    if (av_rescale_rnd(ts_b, b, a, AV_ROUND_DOWN) < ts_a)
        return 1;
    return 0;
}

Given the above values, you get:

a = 1 * 1 = 1
b = 60 * 25 = 1500

Then av_rescale_rnd is called with these parameters:

a = 150
b = 1
c = 1500
rnd = AV_ROUND_DOWN

Given our parameters, we can actually strip down the entire function av_rescale_rnd to the following line. (I will not copy the whole function body for av_rescale_rnd as it is rather long, but you can look at it here.)

return (a * b) / c;

This will return (150 * 1) / 1500, which is 0.

Thus av_rescale_rnd(ts_a, a, b, AV_ROUND_DOWN) < ts_b will resolve to true, because 0 is smaller than ts_b (5), and so av_compare_ts will return -1, which is exactly not what you want.

If you change your r to 1/1 it should work, because now your 5 will actually be treated as 5 seconds:

ts_a = 150
tb_a = 1/25
ts_b = 5
tb_b = 1/1

In av_compare_ts we now get:

a = 1 * 1 = 1
b = 1 * 25 = 25

Then av_rescale_rnd is called with these parameters:

a = 150
b = 1
c = 25
rnd = AV_ROUND_DOWN

This will return (150 * 1) / 25, which is 6.

6 is greater than 5, the condition fails, and av_rescale_rnd is called again, this time with:

a = 5
b = 25
c = 1
rnd = AV_ROUND_DOWN

which will return (5 * 25) / 1, which is 125. That is smaller than 150, thus 1 is returned and voilá your problem is solved.

In case step_size is greater than 1

If the step_size of your audio stream isn't 1, you need to modify your r to account for that, e.g. step_size = 1024:

r = { 1, 1024 };

Let's quickly recap what happens now:

At ~6 seconds:

mAudioOutStream.next_pts = 282
mAudioOutStream.enc->time_base = 1/48000

av_compare_ts gets the following parameters:

ts_a = 282
tb_a = 1/48000
ts_b = 5
tb_b = 1/1024

Thus:

a = 1 * 1024 = 1024
b = 1 * 48000 = 48000

And in av_rescale_rnd:

a = 282
b = 1024
c = 48000
rnd = AV_ROUND_DOWN

(a * b) / c will give (282 * 1024) / 48000 = 288768 / 48000 which is 6.

With r={1,1} you would've gotten 0 again, because it would've calculated (281 * 1) / 48000.

Nope,it doesn't work. In this case the audio stream reaches the end just after 1-2 second. Now I recall that this was the reason I haven't used the {1,1} time stamp. Here is the details of the audio stream: timebase: {1.48000} maybe my next_pts calc is wrong? Each PTS step is 1024 ,codec timestamp and stream timestamps are equal too. — Michael IV, Apr 16 '18 at 08:05
Ah, you need to multiply the denominator in your `r` by the step size. You need `r={1,1024}`. I will update my answer. — Max Vollmer, Apr 16 '18 at 08:35
Discard my comment. I had another bug which caused the {1,1} not to work properly. I will put it in my question. Thanks! — Michael IV, Apr 16 '18 at 08:36
Well, at least your edit explains why 60,1 sometimes worked. — Max Vollmer, Apr 16 '18 at 08:45

FFMPEG: multiplexing streams with different duration

1 Answers1