8

I am trying to speed up a piece of code by having background threads already setup to solve one specific task. When it is time to solve my task I would like to wake up these threads, do the job and block them again waiting for the next task. The task is always the same.

I tried using condition variables (and mutex that need to go with them), but I ended up slowing my code down instead of speeding it up; mostly it happened because the calls to all needed functions are very expensive (pthread_cond_wait/pthread_cond_signal/pthread_mutex_lock/pthread_mutex_unlock).

There is no point in using a thread pool (that I don't have either) because it is a too generic construct; here I want to address only my specific task. Depending on the implementation I would also pay a performance penalty for the queue.

Do you have any suggestion for a quick wake-up without using mutex or con_var?

I was thinking in setup threads like timers reading an atomic variable; if the variable is set to 1 the threads will do the job; if it is set to 0 they will go to sleep for few microseconds (I would start with microsecond sleep since I would like to avoid using spinlocks that might be too expensive for the CPU). What do you think about it? Any suggestion is very appreciated.

I am using Linux, gcc, C and C++.

unwind
  • 391,730
  • 64
  • 469
  • 606
Abruzzo Forte e Gentile
  • 14,423
  • 28
  • 99
  • 173
  • 2
    If your performance requirements are too extreme for the existing mutex/condition-variable approach, then you're already at the stage where you do want to burn a little CPU spinning for more work before falling back on the mutex/condition-variable. Microsecond sleeps may not work as you expect: if your process isn't de-scheduled then the CPU's not given other work anyway, and if it is your latencies may sky-rocket. – Tony Delroy Apr 08 '11 at 09:30
  • Hi Tony. I have a multicore NUMA machine. Is it true that in this case I should not have any context switching? I create the thread without any particular setting or configuration...do you think that any special setting is required to avoid the context switching? – Abruzzo Forte e Gentile Apr 08 '11 at 10:02
  • @Abruzzo: there are lots of factors, the dominant one being that scheduling logic's changed with Linux's kernel versions. But, in general if you tell a scheduler that you've nothing to do and it has something waiting, I wouldn't bet on it keeping you around (better chance if your delay period is clearly intra-time-slice anyway). With any serious tuning, the smart money's on implementing the alternatives and benchmarking with your actual hardware, compilers, task sizes, contention rates, data flows, kernel version etc.. – Tony Delroy Apr 08 '11 at 10:09
  • I would propose a lock-free algorithm using an internal state machine if the task suits it. – Blagovest Buyukliev Apr 08 '11 at 10:21
  • @Blagovest can you elaborate a bit? Do you mean a while loop around a variable keeping the state? I don't understand how to use them for my issue i.e. make my threads really "reactive" on some events and tell them to start doing my tasks as soon as they can. – Abruzzo Forte e Gentile Apr 08 '11 at 10:43
  • @Blagovest: in other words, pure spinning while waiting for events... if you've got CPU to burn and need minimal latency, can't beat it. – Tony Delroy Apr 08 '11 at 10:44
  • I fell like I start loving this idea..just 1 core dedicated to this task; if I set thread affinity it will not hurt other processes running in other cpus ( provided that the affinity for those are set correctly ). – Abruzzo Forte e Gentile Apr 08 '11 at 11:02
  • @Tony @Blagovest thanks a lot for your responses. Have a nice day.Best Regards AFG – Abruzzo Forte e Gentile Apr 08 '11 at 12:56
  • @Abruzzo: no worries - do drop in a note to say how it pans out for you. Cheers. – Tony Delroy Apr 08 '11 at 16:54
  • Do these tasks need to overlap in time on different cores? If not, threading buys you nothing. – Mike Dunlavey Apr 09 '11 at 17:01
  • @Mike I was thinking to spawn one extra thread per core. Why do you say that pays nothing? I thought that being multicore, each core is really independent having its clock. Can you elaborate a bit? – Abruzzo Forte e Gentile Apr 11 '11 at 09:41
  • Yes each core can run in parallel with the others. But what is the nature of the tasks being performed? Can you actually get 2 or more cores performing tasks at the same time, or can you only run one task at a time, therefore only one core at a time? That's what I'm driving at. If the work is basically serial, not parallel, threading over multiple cores won't make it any faster. – Mike Dunlavey Apr 11 '11 at 12:15
  • The nature of the task is parallel. – Abruzzo Forte e Gentile Apr 11 '11 at 13:15

2 Answers2

5

These functions should be fast. If they are taking a large fraction of your time, it is quite possible that you are trying to switch threads too often.

Try buffering up a work queue, and send the signal once a significant amount of work has accumulated.

If this is impossible due to dependencies between the tasks, then your application is not amenable to multithreading at all.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • This effectively happens anyway... i.e., if the handling of the original events is not fast enough, the next few get buffered. So, seems the issue is getting better latency on the leading edges.... – Tony Delroy Apr 08 '11 at 10:13
  • @Tony: The next few don't get buffered or even produced because the main thread gets blocked. This is a strategy to reduce the total number of thread library calls for a given work-set. – Potatoswatter Apr 08 '11 at 10:27
  • @Potatoswatter: I'm just saying that a consuming thread should already be emptying the queue of all available requests before waiting for notifications of further events (i.e. not "eating" them one by one); I agree it's an issue if the app wasn't designed that way. I'm not sure, but I think the question's more about getting consistent low latency for frequent but trivially handled events - e.g. microseconds to handle, microseconds apart. Abruzzo...? – Tony Delroy Apr 08 '11 at 10:43
  • @Tony. That's right. I am not even using queues since my task is perfectly split in something frequent but trivial that I would like to be handled fast by 3 or 4 threads. So My task can be split in 4 tasks for 4 threads; if those run/wake up fast at "on task" event than I have low latency. – Abruzzo Forte e Gentile Apr 08 '11 at 11:07
  • @Tony: The latency shouldn't be affected by the `lock` time because it is called by the master thread after `signal`. That leaves `unlock`, which is fast, and `cond_signal` and the return sequence from `cond_wait`, which should be about as fast as any wakeup pathway in pthreads. Which isn't to say there's no faster way, but unless it's certain latency is the goal, it's good enough. As for one at a time, my experience is that a queue may go one at a time unless you set a minimum size, incurring excess time in `lock/unlock/signal/cond_wait`. – Potatoswatter Apr 08 '11 at 11:21
  • @Abruzzo: How trivial is the task? Is latency essential to the application, or would you rather sacrifice it for throughput? – Potatoswatter Apr 08 '11 at 11:24
0

In order to gain performance in a multithreaded application, spawn as many threads as there are CPUs, not a separate thread for each task. Otherwise you end up with a lot of overhead from context switching.

You may also consider making your algorithm more linear (i.e. by using non-blocking calls).

Blagovest Buyukliev
  • 42,498
  • 14
  • 94
  • 130
  • The number of threads are already like the number of CPUs. Do you know any setting or enforcing setup/attribute TO ENSURE 100% that I don't have any context switching? I have a multicore that is also NUMA machine. – Abruzzo Forte e Gentile Apr 08 '11 at 09:58
  • To not have any context switching means to stop using threads :-) Consider making your algorithm non-blocking. For example, take a look at how the Nginx and Lighttpd servers are made without threads for each incoming connection. – Blagovest Buyukliev Apr 08 '11 at 10:00
  • @Abruzzo Forte e Gentile You need a realtime operating system to give such guarantees. You can come pretty close if you use use the SCHED_FIFO "realtime" scheduler on linux, in addition to pinning your threads to a particular CPU (via process affinity), see the manpage for sched_setscheduler. If your logic can afford it though, you can queue up items and send them in batches to workers, the throughput gain I got from queueing 10 items and firing off a pthread_cond variable vs firing it off for every item is *significant* – nos Apr 08 '11 at 10:38
  • My task has a bit degree of parallelism that I want to exploit with extra threads. I can use lock-free concepts but I need to have my threads already created sitting on the background and waiting for start quickly. – Abruzzo Forte e Gentile Apr 08 '11 at 10:45
  • Think about having your threads do the same thing on different data, not a separate thread dedicated for a task. As for the tasks themselves, think about slicing them in little parts so you can process them asynchronously and keep their state internally. – Blagovest Buyukliev Apr 08 '11 at 11:01