3

I have a thread pool that I use to execute many tiny jobs (millions of jobs, dozens/hundreds of milliseconds each). The jobs are passed in the form of either:

std::bind(&fn, arg1, arg2, arg3...)

or

[&](){fn(arg1, arg2, arg3...);}

with the thread pool taking them like this:

std::queue<std::function<void(void)>> queue;

void addJob(std::function<void(void)> fn)
{
    queue.emplace_back(std::move(fn));
}

Pretty standard stuff....except that I've noticed a bottleneck where if jobs execute in a fast enough time (less than a millisecond), the conversion from lambda/binder to std::function in the addJob function actually takes longer than execution of the jobs themselves. After doing some reading, std::function is notoriously slow and so my bottleneck isn't necessarily unexpected.

Is there a faster way of doing this type of thing? I've looked into drop-in std::function replacements but they either weren't compatible with my compiler or weren't faster. I've also looked into "fast delegates" by Don Clugston but they don't seem to allow the passing of arguments along with functions (maybe I don't understand them correctly?).

I'm compiling with VS2015u3, and the functions passed to the jobs are all static, with their arguments being either ints/floats or pointers to other objects.

Tyson
  • 1,226
  • 1
  • 10
  • 33
  • 1
    Good old C-style function pointers are probably about as fast as you can get, although you'd need to come up with your own mechanism for binding/storing arguments to pass to the function call. Your mechanism might be faster than the one provided by std::function (since you know your program's use case and may be able to trade away flexibility for efficiency), or it may not. – Jeremy Friesner Nov 10 '17 at 19:04
  • @JeremyFriesner : negative. fnptr, virt fn and std::function<> all use similar underlying processor features, suffer from cold pages, cost tens of millisecs (2017) worst case. – lorro Nov 10 '17 at 19:13
  • Consider using std::async or TBB instead of a self made thread pool. –  Nov 10 '17 at 19:13
  • 2
    "few milliseconds each" - is that a typo? Even the slowest implementation of std::function would be in microseconds. –  Nov 10 '17 at 19:35
  • Sorry yes, that was a typo. – Tyson Nov 10 '17 at 19:37
  • Typo in my comment as well :) – lorro Nov 10 '17 at 19:48
  • @lorro sounds like you're agreeing with me, but don't realize it ;) – Jeremy Friesner Nov 10 '17 at 20:51
  • @JeremyFriesner : OP wants a thread pool: N threads, M tasks, M >> N. The way I understand your comment is one indirect dispatch (e.g. fnptr) per M, what I'm recommending is one per N. – lorro Nov 10 '17 at 23:21
  • @lorro OP already has a thread pool, as indicated in the first sentence of the question. He's asking for a more efficient way to bind and dispatch functor objects, which is a problem that is orthogonal to thread pooling. – Jeremy Friesner Nov 10 '17 at 23:37
  • @JeremyFriesner : exactly. What I'm saying is though, you don't need fn ptrs for dispatch. It's faster to just 'loop over task types' if there are many instances of each task. – lorro Nov 10 '17 at 23:39
  • Agreed, that's a good idea (assuming the number of task-types is reasonably small, of course) – Jeremy Friesner Nov 10 '17 at 23:40
  • 1
    The thing is, that depending how much state you capture, you might trigger a heap allocation when you create the `std::function`. Allocating on the heap is way more expensive than the indirection in the function calls (probably rough 1 mic vs 10 nanos). There's also the cost of potential cache misses. How much state are you capturing? There are alternative std::function implementations that will let you control how much stack storage is available, this will allow you to avoid those allocatiosn without making major changes in your program. – Nir Friedman Nov 11 '17 at 01:37
  • @NirFriedman : your comment is a very important one, but even the indirectiom is not 10 nanos all the time: cold cache, etc. can bring it up to micros (we've seen I think 40micros @ 2016 as an extreme case) – lorro Nov 11 '17 at 09:54
  • @lorro How do you get 40 micros? Accessing main memory is like 200 nanos. I'd need to see something pretty strong to buy that number. I think the solution you provided is good and faster than what I'm suggesting, I just think it is more restrictive and requires a lot more complexity in code. I'm like 80% sure that an off the shelf `std::function` such as proposed here: https://github.com/WG21-SG14/SG14/blob/master/Docs/Proposals/NonAllocatingStandardFunction.pdf would solve this problem well enough with almost 0 code change. – Nir Friedman Nov 11 '17 at 21:43
  • @NirFriedman : I can't post you the code: it's IP of a firm that's been involved in high frequency trading. What goes wrong there is cold vtable, cold code segment and incorrect branch prediction emptying the pipeline. Also note that it's not avg. but worst case, which you have to prepare for if you have hard RT requirements. – lorro Nov 12 '17 at 16:22

1 Answers1

2

Have a separate queue for each of the task types - you probably don't have tens of thousands of task types. Each of these can be e.g. a static member of your tasks. Then addJob() is actually the ctor of Task and it's perfectly type-safe.

Then define a compile-time list of your task types and visit it via template metaprogramming (for_each). It'll be way faster as you don't need any virtual call fnptr / std::function<> to achieve this.

This will only work if your tuple code sees all the Task classes (so you can't e.g. add a new descendant of Task to an already running executable by loading the image from disc - hope that's a non-issue).

template<typename D> // CRTP on D
class Task {
public:
    // you might want to static_assert at some point that D is in TaskTypeList

    Task() : it_(tasks_.end()) {} // call enqueue() in descendant

    ~Task() {
        // add your favorite lock here
        if (queued()) {
            tasks_.erase(it_);
        }
    }

    bool queued() const { return it_ != tasks_.end(); }

    static size_t ExecNext() {
        if (!tasks_.empty()) {
            // add your favorite lock here
            auto&& itTask = tasks_.begin();
            tasks_.pop_front();
            // release lock
            (*itTask)();
            itTask->it_ = tasks_.end();
        }
        return tasks_.size();
    }

protected:
    void enqueue() const
    {
        // add your favorite lock here
        tasks_.push_back(static_cast<D*>(this));
        it_ = tasks_.rbegin();
    }

private:
    std::list<D*>::iterator it_;

    static std::list<D*> tasks_; // you can have one per thread, too - then you don't need locking, but tasks are assigned to threads statically
};

struct MyTask : Task<MyTask> {
    MyTask() { enqueue(); } // call enqueue only when the class is ready
    void operator()() { /* add task here */ }
    // ...
};

struct MyTask2; // etc.

template<typename...>
struct list_ {};

using TaskTypeList = list_<MyTask, MyTask2>;

void thread_pocess(list_<>) {}

template<typename TaskType, typename... TaskTypes>
void thread_pocess(list_<TaskType, TaskTypes...>)
{
    TaskType::ExecNext();
    thread_process(list_<TaskTypes...>());
}

void thread_process(void*)
{
    for (;;) {
        thread_process(TaskTypeList());
    }
}

There's a lot to tune on this code: different threads should start from different parts of the queue (or one would use a ring, or several queues and either static/dynamic assignment to threads), you'd send it to sleep when there are absolutely no tasks, one could have an enum for the tasks, etc.

Note that this can't be used with arbitrary lambdas: you need to list task types. You need to 'communicate' the lambda type out of the function where you declare it (e.g. by returning `std::make_pair(retval, list_) and sometimes it's not easy. However, you can always convert a lambda to a functor, which is straightforward - just ugly.

lorro
  • 10,687
  • 23
  • 36
  • 1
    Sorry, I'm having a hard time understanding your proposed solution...how does having a per-thread queue and then storing a tuple of my threads get around the need to somehow pass them an arbitrary function with arguments (currently stored as std::function) to be executed later? If you could elaborate a bit more I would appreciate it. – Tyson Nov 10 '17 at 19:32
  • Don't pass arbitrary functions: pass MyThreadImpl classes that implement a ThreadConcept that you define (e.g. have op()) and descend from ThreadBase (this is CRTP). Then MyThreadImpl<> can static_cast(this) and call MyThreadImpl methods, e.g. op() w/o virtual overhead. – lorro Nov 10 '17 at 19:46
  • Hmm...I'll be honest, that's all a bit over my head at the moment...guess I've got some research to do. Cheers! – Tyson Nov 10 '17 at 20:20
  • Fix: above are meant to be Tasks of course, not Threads. About to fix in post as well. – lorro Nov 10 '17 at 23:22
  • 1
    Thanks a lot for posting the full code snippet, I now get what's going on, and more specifically what you meant by looping over task types. This is definitely the type of solution I was looking for! – Tyson Nov 11 '17 at 09:07