Why is ffmpeg faster than this minimal example?

Question

I'm wanting to read the audio out of a video file as fast as possible, using the libav libraries. It's all working fine, but it seems like it could be faster.

To get a performance baseline, I ran this ffmpeg command and timed it:

time ffmpeg -threads 1 -i file -map 0:a:0 -f null -

On a test file (a 2.5gb 2hr .MOV with pcm_s16be audio) this comes out to about 1.35 seconds on my M1 Macbook Pro.

On the other hand, this minimal C code (based on FFmpeg's "Demuxing and decoding" example) is consistently around 0.3 seconds slower.

#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>

static int decode_packet(AVCodecContext *dec, const AVPacket *pkt, AVFrame *frame)
{
    int ret = 0;

    // submit the packet to the decoder
    ret = avcodec_send_packet(dec, pkt);

    // get all the available frames from the decoder
    while (ret >= 0) {
        ret = avcodec_receive_frame(dec, frame);
        av_frame_unref(frame);
    }

    return 0;
}

int main (int argc, char **argv)
{
    int ret = 0;
    AVFormatContext *fmt_ctx = NULL;
    AVCodecContext  *dec_ctx = NULL;
    AVFrame *frame = NULL;
    AVPacket *pkt = NULL;

    if (argc != 3) {
        exit(1);
    }

    int stream_idx = atoi(argv[2]);

    /* open input file, and allocate format context */
    avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);

    /* get the stream */
    AVStream *st = fmt_ctx->streams[stream_idx];

    /* find a decoder for the stream */
    AVCodec *dec = avcodec_find_decoder(st->codecpar->codec_id);

    /* allocate a codec context for the decoder */
    dec_ctx = avcodec_alloc_context3(dec);

    /* copy codec parameters from input stream to output codec context */
    avcodec_parameters_to_context(dec_ctx, st->codecpar);

    /* init the decoder */
    avcodec_open2(dec_ctx, dec, NULL);

    /* allocate frame and packet structs */
    frame = av_frame_alloc();
    pkt = av_packet_alloc();

    /* read frames from the specified stream */
    while (av_read_frame(fmt_ctx, pkt) >= 0) {
        if (pkt->stream_index == stream_idx)
            ret = decode_packet(dec_ctx, pkt, frame);

        av_packet_unref(pkt);
        if (ret < 0)
            break;
    }

    /* flush the decoders */
    decode_packet(dec_ctx, NULL, frame);

    return ret < 0;
}

I tried measuring parts of this program to see if it was spending a lot of time in the setup, but it's not – at least 1.5 seconds of the runtime is the loop where it's reading frames.

So I took some flamegraph recordings (using cargo-flamegraph) and ran each a few times to make sure the timing was consistent. There's probably some overhead since both were consistently higher than running normally, but they still have the ~0.3 second delta.

# 1.812 total
time sudo flamegraph ./minimal file 1

# 1.542 total
time sudo flamegraph ffmpeg -threads 1 -i file -map 0:a:0 -f null - 2>&1

Here are the flamegraphs stacked up, scaled so that the faster one is only 85% as wide as the slower one. (click for larger)

The interesting thing that stands out to me is how long is spent on read in the minimal example vs. ffmpeg:

The time spent on lseek is also a lot longer in the minimal program – it's plainly visible in that flamegraph, but in the ffmpeg flamegraph, lseek is a single pixel wide.

What's causing this discrepancy? Is ffmpeg actually doing less work than I think it is here? Is the minimal code doing something naive? Is there some buffering or other I/O optimizations that ffmpeg has enabled?

How can I shave 0.3 seconds off of the minimal example's runtime?

Did you compare the example to the actual ffmpeg sourcecode already? What differences did you find? — mashuptwice, Jul 22 '22 at 04:01
Yeah, and the actual ffmpeg source is much more complex so it's hard to be sure I haven't missed something, but in broad strokes the steps look similar. One thing that occurred to me was to try modifying ffmpeg.c to strip out everything that this command doesn't touch, to get a clearer picture. Maybe also using dtrace to figure out which functions I can prune. — Dave Ceddia, Jul 22 '22 at 18:21
Did some more digging, and I think it's because of the `-map 0:a:0` option. With that set, ffmpeg sets the `discard` property on the other streams to `AVDISCARD_ALL`, and those packets are skipped. They do get read from disk, but they never make it as far as `av_read_frame`. My current challenge is that setting `AVDISCARD_ALL` myself seems to be ignored... — Dave Ceddia, Jul 23 '22 at 00:31
Whoops, I managed to pull include files from 2 different versions of ffmpeg, which somehow compiled and ran fine, but failed to set the `discard` flag because I think it was writing to the wrong offset. With that fixed it actually works! — Dave Ceddia, Jul 23 '22 at 02:56

score 2 · Accepted Answer · answered Jul 23 '22 at 03:15

The difference is that ffmpeg, when run with the -map flag, is explicitly setting the AVDISCARD_ALL flag on the streams that were going to be ignored. The packets for those streams still get read from disk, but with this flag set, they never make it into av_read_frame (with the mov demuxer, at least).

In the example code, by contrast, this while loop receives every packet from every stream, and only drops the packets after they've been (wastefully) passed through av_read_frame.

/* read frames from the specified stream */
while (av_read_frame(fmt_ctx, pkt) >= 0) {
    if (pkt->stream_index == stream_idx)
        ret = decode_packet(dec_ctx, pkt, frame);

    av_packet_unref(pkt);
    if (ret < 0)
        break;
}

I changed the program to set the discard flag on the unused streams:

// ...

/* open input file, and allocate format context */
avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);

/* get the stream */
AVStream *st = fmt_ctx->streams[stream_idx];

/* discard packets from other streams */
for(int i = 0; i < fmt_ctx->nb_streams; i++) {
  fmt_ctx->streams[i]->discard = AVDISCARD_ALL;
}
st->discard = AVDISCARD_DEFAULT;

// ...

With that change in place, it gives about a ~1.8x speedup on the same test file, after the cache is warmed up.

Minimal example, without discard   1.593s
ffmpeg with -map 0:a:0             1.404s
Minimal example, with discard      0.898s

Why is ffmpeg faster than this minimal example?

1 Answers1