How to create an efficient multi-threaded task scheduler in C++?

Question

I'd like to create a very efficient task scheduler system in C++.

The basic idea is this:

class Task {
    public:
        virtual void run() = 0;
};

class Scheduler {
    public:
        void add(Task &task, double delayToRun);
};

Behind Scheduler, there should be a fixed-size thread pool, which run the tasks (I don't want to create a thread for each task). delayToRun means that the task doesn't get executed immediately, but delayToRun seconds later (measuring from the point it was added into the Scheduler).

(delayToRun means an "at-least" value, of course. If the system is loaded, or if we ask the impossible from the Scheduler, it won't be able to handle our request. But it should do the best it can)

And here's my problem. How to implement delayToRun functionality efficiently? I'm trying to solve this problem with the use of mutexes and condition variables.

I see two ways:

With manager thread

Scheduler contains two queues: allTasksQueue, and tasksReadyToRunQueue. A task gets added into allTasksQueue at Scheduler::add. There is a manager thread, which waits the smallest amount of time so it can put a task from allTasksQueue to tasksReadyToRunQueue. Worker threads wait for a task available in tasksReadyToRunQueue.

If Scheduler::add adds a task in front of allTasksQueue (a task, which has a value of delayToRun so it should go before the current soonest-to-run task), then the manager task need to be woken up, so it can update the time of wait.

This method can be considered inefficient, because it needs two queues, and it needs two condvar.signals to make a task run (one for allTasksQueue->tasksReadyToRunQueue, and one for signalling a worker thread to actually run the task)

Without manager thread

There is one queue in the scheduler. A task gets added into this queue at Scheduler::add. A worker thread checks the queue. If it is empty, it waits without a time constraint. If it is not empty, it waits for the soonest task.

If there is only one condition variable for which the working threads waiting for: this method can be considered inefficient, because if a task added in front of the queue (front means, if there are N worker threads, then the task index < N) then all the worker threads need to be woken up to update the time which they are waiting for.
If there is a separate condition variable for each thread, then we can control which thread to wake up, so in this case we don't need to wake up all threads (we only need to wake up the thread which has the largest waiting time, so we need to manage this value). I'm currently thinking about implementing this, but working out the exact details are complex. Are there any recommendations/thoughts/document on this method?

Is there any better solution for this problem? I'm trying to use standard C++ features, but I'm willing to use platform dependent (my main platform is linux) tools too (like pthreads), or even linux specific tools (like futexes), if they provide a better solution.

why dont you start the worker thread imediately and the first thing it does is to wait for `delayToRun` ? — 463035818_is_not_an_ai, Sep 07 '17 at 11:17
@tobi303: I want tasks to use the thread pool, I don't want to create a thread for each task. (I've edited the question to reflect this). — geza, Sep 07 '17 at 11:20
yes sure you can use a thread pool, but instead of only doing the task, a thread when assigned a task it could also wait for `delayToRun` before it starts with the task — 463035818_is_not_an_ai, Sep 07 '17 at 11:22
@tobi303: `delayToRun` means to wait that time from it was added to the scheduler (I've edited again my question). — geza, Sep 07 '17 at 11:26
i am just asking out of curiosity, not really claiming that it would be "the solution", but I dont understand your objections. You add it to the scheduler, calculate the absolute time when it should start, and then it is immediatly availble for a thread to pick it up, no difference to thread pool with out the delay. THe only difference would be that a thread waits until corresponding time before it actually starts the task — 463035818_is_not_an_ai, Sep 07 '17 at 11:28
@tobi303: suppose, that only one worker thread exists. And I add a task, with 2 seconds of delay. Here, the worker thread starts waiting for 2 seconds. Now suppose, that after adding this task, I add another task, with only 1 seconds of delay. Now, we need to decrease the time which the worker thread waits, so we need to wake it up, so it can go to sleep for 1 second instead of 2 seconds. If there are a lot of threads, we need to wake up all of them, to update the waiting time. — geza, Sep 07 '17 at 11:32

caf · Accepted Answer · 2017-09-12T00:52:07.873

9

You can avoid both having a separate "manager" thread, and having to wake up a large number of tasks when the next-to-run task changes, by using a design where a single pool thread waits for the "next to run" task (if there is one) on one condition variable, and the remaining pool threads wait indefinitely on a second condition variable.

The pool threads would execute pseudocode along these lines:

pthread_mutex_lock(&queue_lock);

while (running)
{
    if (head task is ready to run)
    {
        dequeue head task;
        if (task_thread == 1)
            pthread_cond_signal(&task_cv);
        else
            pthread_cond_signal(&queue_cv);

        pthread_mutex_unlock(&queue_lock);
        run dequeued task;
        pthread_mutex_lock(&queue_lock);
    }
    else if (!queue_empty && task_thread == 0)
    {
        task_thread = 1;
        pthread_cond_timedwait(&task_cv, &queue_lock, time head task is ready to run);
        task_thread = 0;
    }
    else
    {
        pthread_cond_wait(&queue_cv, &queue_lock);
    }
}

pthread_mutex_unlock(&queue_lock);

If you change the next task to run, then you execute:

if (task_thread == 1)
    pthread_cond_signal(&task_cv);
else
    pthread_cond_signal(&queue_cv);

with the queue_lock held.

Under this scheme, all wakeups are directly at only a single thread, there's only one priority queue of tasks, and there's no manager thread required.

edited Sep 12 '17 at 00:52

answered Sep 11 '17 at 06:50

caf

233,326
40
323
462

Interesting idea, thanks for this! It is simpler (which is good) than my cv-for-each-thread solution, and has similar runtime performance. – geza Sep 11 '17 at 09:24
@geza: I just made a slight fix to the logic (reordering the cases, so that it waits on the `queue_cv` if the queue is empty, or another task is already waiting on the head task). – caf Sep 12 '17 at 00:54
I'll award this answer because it is the closest one solving my problem. I haven't implemented it yet, but it seems to be OK. The only critique could be that it sometimes wakes up unnecessarily a thread (it always wakes up a thread before running a task - a more sophisticated algorithm would only wake up a thread when it is necessary), but this might be fixed somehow. I'll leave the question open for a while, to encourage others to come up with even better ideas. – geza Sep 16 '17 at 18:35
I choose this variant to implement. While my "without manager thread+multiple cv" is supposedly more performant, it has a drawback that it uses all threads for timed waiting. This variant uses only one, so other threads can do other useful work instead of waiting (in the case of threadpool is a shared resource, this is a good feature to have). – geza Oct 12 '17 at 12:41

Basile Starynkevitch · Answer 2 · 2017-09-07T12:20:36.260

6

Your specification is a bit too strong:

delayToRun means that the task doesn't get executed immediately, but delayToRun seconds later

You forgot to add "at least" :

The task don't get executed now, but at least delayToRun seconds later

The point is that if ten thousand tasks are all scheduled with a 0.1 delayToRun, they surely won't practically be able to run at the same time.

With such correction, you just maintain some queue (or agenda) of (scheduled-start-time, closure to run), you keep that queue sorted, and you start N (some fixed number) of threads which atomically pop the first element of the agenda and run it.

then all the worker threads need to be woken up to update the time which they are waiting for.

No, some worker threads would be woken up.

Read about condition variables and broadcast.

You might also user POSIX timers, see timer_create(2), or Linux specific fd timer, see timerfd_create(2)

You probably would avoid running blocking system calls in your threads, and have some central thread managing them using some event loop (see poll(2)...); otherwise, if you have a hundred tasks running sleep(100) and one task scheduled to run in half a second it won't run before a hundred seconds.

You may want to read about continuation-passing style programming (it -CPS- is highly relevant). Read the paper about Continuation Passing C by Juliusz Chroboczek.

Look also into Qt threads.

You could also consider coding in Go (with its Goroutines).

edited Sep 07 '17 at 12:20

answered Sep 07 '17 at 11:32

Basile Starynkevitch

223,805
18
296
547

"All 10000 threads got started at the same time like you asked. 9992 of them were immediately interrupted before they could execute a single instruction." In all seriousness "all worker threads need to be woken up" typically stands for "all worker threads are moved from the blocked queue into the ready queue". – nwp Sep 07 '17 at 11:36
I think you're wrong. All of them need to be woken up. Suppose that you have 2 worker threads. Both sleeping for 2 seconds. Now, you add a task with 1 second delay. You wake up only one, so you'll have a thread which sleeps for 1 second, and another which sleeps for 2. Now, add a long task, which has 0.1 delay. If the 1-second-delay thread runs this, then the 1-second delay task won't run in time (by the second thread, because it still waits for 2 seconds). – geza Sep 07 '17 at 11:39
And of course, if we put the Scheduler into impossible situation (like run 10'000 tasks in a single-core machine), it won't be able to do it. But it should do best it can. – geza Sep 07 '17 at 11:45
@BasileStarynkevitch: thanks for the help, but I specifically need this efficient Scheduler. It should be able to run long and very short tasks, and the task abstraction is a must (task can be anything, I've no control over it), so CPS is not a solution here. I'll check out posix timers, maybe it can help. – geza Sep 07 '17 at 11:53

LWimsey · Answer 3 · 2017-09-10T15:45:00.440

This is a sample implementation for the interface you provided that comes closest to your 'With manager thread' description.

It uses a single thread (timer_thread) to manage a queue (allTasksQueue) that is sorted based on the actual time when a task must be started (std::chrono::time_point).
The 'queue' is a std::priority_queue (which keeps its time_point key elements sorted).

timer_thread is normally suspended until the next task is started or when a new task is added.
When a task is about to be run, it is placed in tasksReadyToRunQueue, one of the worker threads is signaled, wakes up, removes it from the queue and starts processing the task..

Note that the thread pool has a compile-time upper limit for the number of threads (40). If you are scheduling more tasks than can be dispatched to workers, new task will block until threads are available again.

You said this approach is not efficient, but overall, it seems reasonably efficient to me. It's all event driven and you are not wasting CPU cycles by unnecessary spinning. Of course, it's just an example, optimizations are possible (note: std::multimap has been replaced with std::priority_queue).

The implementation is C++11 compliant

#include <iostream>
#include <chrono>
#include <queue>
#include <unistd.h>
#include <vector>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <memory>

class Task {
public:
    virtual void run() = 0;
    virtual ~Task() { }
};

class Scheduler {
public:
    Scheduler();
    ~Scheduler();

    void add(Task &task, double delayToRun);

private:
    using timepoint = std::chrono::time_point<std::chrono::steady_clock>;

    struct key {
        timepoint tp;
        Task *taskp;
    };

    struct TScomp {
        bool operator()(const key &a, const key &b) const
        {
            return a.tp > b.tp;
        }
    };

    const int ThreadPoolSize = 40;

    std::vector<std::thread> ThreadPool;
    std::vector<Task *> tasksReadyToRunQueue;

    std::priority_queue<key, std::vector<key>, TScomp> allTasksQueue;

    std::thread TimerThr;
    std::mutex TimerMtx, WorkerMtx;
    std::condition_variable TimerCV, WorkerCV;

    bool WorkerIsRunning = true;
    bool TimerIsRunning = true;

    void worker_thread();
    void timer_thread();
};

Scheduler::Scheduler()
{
    for (int i = 0; i <ThreadPoolSize; ++i)
        ThreadPool.push_back(std::thread(&Scheduler::worker_thread, this));

    TimerThr = std::thread(&Scheduler::timer_thread, this);
}

Scheduler::~Scheduler()
{
    {
        std::lock_guard<std::mutex> lck{TimerMtx};
        TimerIsRunning = false;
        TimerCV.notify_one();
    }
    TimerThr.join();

    {
        std::lock_guard<std::mutex> lck{WorkerMtx};
        WorkerIsRunning = false;
        WorkerCV.notify_all();
    }
    for (auto &t : ThreadPool)
        t.join();
}

void Scheduler::add(Task &task, double delayToRun)
{
    auto now = std::chrono::steady_clock::now();
    long delay_ms = delayToRun * 1000;

    std::chrono::milliseconds duration (delay_ms);

    timepoint tp = now + duration;

    if (now >= tp)
    {
        /*
         * This is a short-cut
         * When time is due, the task is directly dispatched to the workers
         */
        std::lock_guard<std::mutex> lck{WorkerMtx};
        tasksReadyToRunQueue.push_back(&task);
        WorkerCV.notify_one();

    } else
    {
        std::lock_guard<std::mutex> lck{TimerMtx};

        allTasksQueue.push({tp, &task});

        TimerCV.notify_one();
    }
}

void Scheduler::worker_thread()
{
    for (;;)
    {
        std::unique_lock<std::mutex> lck{WorkerMtx};

        WorkerCV.wait(lck, [this] { return tasksReadyToRunQueue.size() != 0 ||
                                           !WorkerIsRunning; } );

        if (!WorkerIsRunning)
            break;

        Task *p = tasksReadyToRunQueue.back();
        tasksReadyToRunQueue.pop_back();

        lck.unlock();

        p->run();

        delete p; // delete Task
    }
}

void Scheduler::timer_thread()
{
    for (;;)
    {
        std::unique_lock<std::mutex> lck{TimerMtx};

        if (!TimerIsRunning)
            break;

        auto duration = std::chrono::nanoseconds(1000000000);

        if (allTasksQueue.size() != 0)
        {
            auto now = std::chrono::steady_clock::now();

            auto head = allTasksQueue.top();
            Task *p = head.taskp;

            duration = head.tp - now;
            if (now >= head.tp)
            {
                /*
                 * A Task is due, pass to worker threads
                 */
                std::unique_lock<std::mutex> ulck{WorkerMtx};
                tasksReadyToRunQueue.push_back(p);
                WorkerCV.notify_one();
                ulck.unlock();

                allTasksQueue.pop();
            }
        }

        TimerCV.wait_for(lck, duration);
    }
}
/*
 * End sample implementation
 */



class DemoTask : public Task {
    int n;
public:
    DemoTask(int n=0) : n{n} { }
    void run() override
    {
        std::cout << "Start task " << n << std::endl;;
        std::this_thread::sleep_for(std::chrono::seconds(2));
        std::cout << " Stop task " << n << std::endl;;
    }
};

int main()
{
    Scheduler sched;

    Task *t0 = new DemoTask{0};
    Task *t1 = new DemoTask{1};
    Task *t2 = new DemoTask{2};
    Task *t3 = new DemoTask{3};
    Task *t4 = new DemoTask{4};
    Task *t5 = new DemoTask{5};

    sched.add(*t0, 7.313);
    sched.add(*t1, 2.213);
    sched.add(*t2, 0.713);
    sched.add(*t3, 1.243);
    sched.add(*t4, 0.913);
    sched.add(*t5, 3.313);

    std::this_thread::sleep_for(std::chrono::seconds(10));
}

Thanks for the effort to write this code! By inefficient, I meant "I think that a more efficient algorithm is possible". And indeed, without manager thread, and using one CV for each thread seems more efficient (only one CV.notify is needed for a task to run, instead of two like in this method). For a lot of small tasks, this can be a non-negligible difference. (and you're right, there is a better container for this. Instead of multimap, one should use heap, it is better suited for this) — geza, Sep 10 '17 at 10:16
The assumption that efficiency is reduced because of a thread doing some lightweight admin work may not be justified; It is difficult to say without performance measurement. The advantage of this approach is that tasks and worker threads are separated, which improves scalability. Other solutions are possible of course. — LWimsey, Sep 10 '17 at 14:38
If there's low load on the machine, then yes, it is true. But at high loads, these lightweight operations can be expensive (maybe they call into the kernel, causes context switches, etc.). That's why I asked my question. I'd like to create a very efficient Scheduler, even if it is more complex. I'm currently implementing the "no-manager-thread-multiple-cv" variant, and as it turned out, it is not that complex as I first thought (or I have a bug in the algorithm...) — geza, Sep 10 '17 at 14:46
But maybe you're right, and it is not worth the hassle. Maybe I'll compare your implementation with mine to see the performance difference between them. — geza, Sep 10 '17 at 14:49
it's an interesting exercise anyway.. I'll see if I can come up with something you were expecting in the first place — LWimsey, Sep 10 '17 at 15:29
`std::multimap` was a bad idea.. It's using `std::priority_queue` now — LWimsey, Sep 10 '17 at 15:45
this doesn't support periodic timers yes? also any particular reason behind using priority queue and not just a queue? — Jazzy, Aug 17 '21 at 03:24
@Jazzy No periodic timers, these are one-time events. The reason for a `std::priority_queue` is to keep the events ordered by time. The one at the top is the first that will be handled. — LWimsey, Aug 26 '21 at 01:05

score 1 · Answer 4 · answered Sep 09 '17 at 19:50

1

It means that you want to run all tasks continuously using some order.

You can create some type of sorted by a delay stack (or even linked list) of tasks. When a new task is coming you should insert it in the position depending of a delay time (just efficiently calculate that position and efficiently insert the new task).

Run all tasks starting with the head of the task stack (or list).

answered Sep 09 '17 at 19:50

Alex Bod

182
2
6

This problem is much more complex, so unfortunately this answer doesn't help. – geza Sep 09 '17 at 20:04

score 1 · Answer 5 · answered Sep 14 '17 at 15:53

Core code for C++11:

#include <thread>
#include <queue>
#include <chrono>
#include <mutex>
#include <atomic>
using namespace std::chrono;
using namespace std;
class Task {
public:
    virtual void run() = 0;
};
template<typename T, typename = enable_if<std::is_base_of<Task, T>::value>>
class SchedulerItem {
public:
    T task;
    time_point<steady_clock> startTime;
    int delay;
    SchedulerItem(T t, time_point<steady_clock> s, int d) : task(t), startTime(s), delay(d){}
};
template<typename T, typename = enable_if<std::is_base_of<Task, T>::value>>
class Scheduler {
public:
    queue<SchedulerItem<T>> pool;
    mutex mtx;
    atomic<bool> running;
    Scheduler() : running(false){}
    void add(T task, double delayMsToRun) {
        lock_guard<mutex> lock(mtx);
        pool.push(SchedulerItem<T>(task, high_resolution_clock::now(), delayMsToRun));
        if (running == false) runNext();
    }
    void runNext(void) {
        running = true;
        auto th = [this]() {
            mtx.lock();
            auto item = pool.front();
            pool.pop();
            mtx.unlock();
            auto remaining = (item.startTime + milliseconds(item.delay)) - high_resolution_clock::now();
            if(remaining.count() > 0) this_thread::sleep_for(remaining);
            item.task.run();
            if(pool.size() > 0) 
                runNext();
            else
                running = false;
        };
        thread t(th);
        t.detach();
    }
};

Test code:

class MyTask : Task {
public:
    virtual void run() override {
        printf("mytask \n");
    };
};
int main()
{
    Scheduler<MyTask> s;

    s.add(MyTask(), 0);
    s.add(MyTask(), 2000);
    s.add(MyTask(), 2500);
    s.add(MyTask(), 6000);
    std::this_thread::sleep_for(std::chrono::seconds(10));

}

How to create an efficient multi-threaded task scheduler in C++?

With manager thread

Without manager thread

5 Answers5