ffmpeg(-mt) and TBB

Question

I just started using the latest build of ffmpeg into which ffmpeg-mt has been merged.

However, since my application uses TBB (Intel Threading Building Blocks), the ffmpeg-mt imlementation with new thread creation and synchronization does not quite fit, as it could potentially block my TBB tasks executing the decode functions. Also it would trash the cache unnecessarily.

I was looking around in pthread.c which seems to implement the interface which ffmpeg uses to enable multithreading.

My question is whether it would be possible to create a tbb.c which implements the same functions but using tbb tasks instead of explicit threads?

I am not experienced with C, but my guess is that it would not be possible to easily compile tbb (which is C++) into ffmpeg. So maybe somehow overwriting the ffmpeg function pointers during run-time would be the way to go?

I would appreciate any suggestions or comments in regards to implementing TBB into ffmpeg threading api.

This question makes no sense. FFmpeg's internal use of threads is irrelevant and orthogonal to your use of threads in the calling application. Also, as far as I know FFmpeg does not even use multiple threads unless you request/allow it to do so. — R.. GitHub STOP HELPING ICE, May 18 '11 at 21:12
I disagree, it is not irrelevant. Since I use TBB for everything else in my application, and also call ffmpeg from it. I would like to avoid the overhead of creating an additional thread and the overhead of oversubscribing the cpu with ffmpegs threads (ffmpeg is not the only thing running). I am aware that it doesn't use multiple threads unless I tell it, however I would like to have multithreaded decoding while using the task-scheduler. What of this makes no sense? If I give ffmpeg 4 threads, then I will oversubscribe with 4 heavy threads. — ronag, May 18 '11 at 22:01
In this case I have profiled it by running multiple ffmpeg decoders on different threads and in TBB it makes enough of a difference. Especially as I decode an unknown number of files in parallel. So if I decode 6 files in parallel each using 4 threads I will oversubscribe with 24 threads, totally trashing the cache. — ronag, May 18 '11 at 22:12
I think you need to just control the codec parameters based on the number of decoders you already have going in parallel... — R.. GitHub STOP HELPING ICE, May 18 '11 at 23:05
Sure that could work... but what if I have 8 decoders running then for the next decoder I only use 1 thread, then the 8 decoders finish running, and the last decoder is only using 1 thread until it finishes, not optimal. — ronag, May 18 '11 at 23:11
Is there a way to add/remove threads from the codec context once it's initialized? Surely you could resync and reinitialize a new context at the next keyframe, but that would be a major pain... — R.. GitHub STOP HELPING ICE, May 18 '11 at 23:16
Another approach would be to just always give each decoder context enough threads to load all the cpus, and round-robin which decoder you're running from the calling code, rather than calling them all simultaneously from different threads. Of course this is probably not as efficient as running completely independent decoding tasks on each cpu/core, but it might be "good enough" and it's very simple to do. — R.. GitHub STOP HELPING ICE, May 18 '11 at 23:17

ronag · Accepted Answer · 2011-05-19T14:22:28.350

So I figured out how to do it by reading through the ffmpeg code.

Basicly all you have to do is to include the code below and use tbb_avcodec_open/tbb_avcodec_close instead of ffmpegs' avcodec_open/avcodec_close.

This will use TBB tasks to execute decoding in parallel.

 // Author Robert Nagy

#include "tbb_avcodec.h"

#include <tbb/task.h>
#include <tbb/atomic.h>

extern "C" 
{
    #define __STDC_CONSTANT_MACROS
    #define __STDC_LIMIT_MACROS
    #include <libavformat/avformat.h>
}

int task_execute(AVCodecContext* s, std::function<int(void* arg, int arg_size, int jobnr, int threadnr)>&& func, void* arg, int* ret, int count, int size)
{   
    tbb::atomic<int> counter;
    counter = 0;

    // Execute s->thread_count number of tasks in parallel.
    tbb::parallel_for(0, s->thread_count, 1, [&](int threadnr) 
    {
        while(true)
        {
            int jobnr = counter++;
            if(jobnr >= count)
                break;

            int r = func(arg, size, jobnr, threadnr);
            if (ret)
                ret[jobnr] = r;
        }
    });

    return 0;
}

int thread_execute(AVCodecContext* s, int (*func)(AVCodecContext *c2, void *arg2), void* arg, int* ret, int count, int size)
{
    return task_execute(s, [&](void* arg, int arg_size, int jobnr, int threadnr) -> int
    {
        return func(s, reinterpret_cast<uint8_t*>(arg) + jobnr*size);
    }, arg, ret, count, size);
}

int thread_execute2(AVCodecContext* s, int (*func)(AVCodecContext* c2, void* arg2, int, int), void* arg, int* ret, int count)
{
    return task_execute(s, [&](void* arg, int arg_size, int jobnr, int threadnr) -> int
    {
        return func(s, arg, jobnr, threadnr);
    }, arg, ret, count, 0);
}

void thread_init(AVCodecContext* s)
{
    static const size_t MAX_THREADS = 16; // See mpegvideo.h
    static int dummy_opaque;

    s->active_thread_type = FF_THREAD_SLICE;
    s->thread_opaque      = &dummy_opaque; 
    s->execute            = thread_execute;
    s->execute2           = thread_execute2;
    s->thread_count       = MAX_THREADS; // We are using a task-scheduler, so use as many "threads/tasks" as possible.
}

void thread_free(AVCodecContext* s)
{
    s->thread_opaque = nullptr;
}

int tbb_avcodec_open(AVCodecContext* avctx, AVCodec* codec)
{
    avctx->thread_count = 1;
    if((codec->capabilities & CODEC_CAP_SLICE_THREADS) && (avctx->thread_type & FF_THREAD_SLICE))
        thread_init(avctx);
// ff_thread_init will not be executed since thread_opaque != nullptr || thread_count == 1.
    return avcodec_open(avctx, codec); 
}

int tbb_avcodec_close(AVCodecContext* avctx)
{
    thread_free(avctx);
    // ff_thread_free will not be executed since thread_opaque == nullptr.
    return avcodec_close(avctx); 
}

score 2 · Answer 2 · answered May 24 '11 at 11:28

Re-posting here my response to you at the TBB forum, for sake of whoever at SO can be interested.

Your code in the answer above looks good to me; a clever way to use TBB in a context that was designed with native threads in mind. I wonder if it can be made even more TBBish, so to say. I have some ideas which you can try if you have time and desire.

The following two items can be of interest if there is a desire/need to control the number of threads.

in thread_init, create a heap-allocated tbb::task_scheduler_init (TSI) object, and initialize it with as many threads as desired (not necessary MAX_THREADS). Keep the address of this object in s->thread_opaque if possible/allowed; if not, a possible solution is a global map that maps AVCodecContext* to the address of the corresponding task_scheduler_init.
correspondingly in thread_free, obtain and remove the TSI object.

Independently of the above, another potential change is in how to call tbb::parallel_for. Instead of using it to merely create enough threads, cannot it be used for its direct purpose, like below?

int task_execute(AVCodecContext* s,
                 std::function<int(void*, int, int, int)>&& f,
                 void* arg, int* ret, int count, int size)   
{      
    tbb::atomic<int> counter;   
    counter = 0;   

    // Execute 'count' number of tasks in parallel.   
    tbb::parallel_for(tbb::blocked_range<int>(0, count, 2),
                      [&](const tbb::blocked_range<int> &r)    
    {   
        int threadnr = counter++;   
        for(int jobnr=r.begin(); jobnr!=r.end(); ++jobnr)
        {   
            int r = func(arg, size, jobnr, threadnr);   
            if (ret)   
                ret[jobnr] = r;   
        }
        --counter;
    });   

    return 0;   
}

This can perform better if count is significantly greater than thread_count, because a) more parallel slack means TBB works more efficiently (which you apparently know), and b) the overhead of the centralized atomic counter is spread over more iterations. Note that I selected the grain size of 2 for blocked_range; this is because the counter is both incremented and decremented inside the loop body, and so at least two iterations per task (and correspondingly, count>=2*thread_count) are necessary to "match" your variant.

Thanks for the improvement! This would also enable better load-balancing through task-stealing. — ronag, May 24 '11 at 17:04
There is one problem with this code though. It does not enforce the invariant (threadnr < MAX_THREADS). MAX_THREADS = 16 is a limitation imposed by ffmpeg which uses threadnr as an index in "thread local storage", which consists of arrays with the size MAX_THREADS. — ronag, May 24 '11 at 17:18
This invariant can be enforced through `task_scheduler_init` I think. I.e. ask `default_num_threads()` and if it's greater than 16, explicitly create just 16. Another possibility is to check the obtained `threadnr`, and if greater than 16, decrement `counter`, pause for some time to reduce contention, and try getting the thread number again. — Alexey Kukanov, May 24 '11 at 19:03

ffmpeg(-mt) and TBB

2 Answers2