5

I wrote a thread pool with as many threads as I have (spare cores), to avoid context switching. Whenever a new task needs to be executed, that task is added to a lock-free ring buffer for threads of the thread pool to consume. Each time a new task is added I currently call sem_post.

My benchmarks shows that the call to sem_post takes 10 microseconds when there are threads waiting for the semaphore. Some calls only take 50 ns (which probably means that entirely in user space it could be established that there were no threads that could be woken up), but also 350 +/- 30 nanoseconds is a frequently seen value.

This question is about the case where one or more threads have/had nothing to do and are waiting on the semaphore.

I am not happy at all that in that case the caller (that tries to wake up a new thread) spends 10 microseconds in sem_post.

Isn't there a faster way (from the view point of the caller) to wake up a sleeping thread? I can live with a delay of 10 microseconds until that new thread finally starts running, but the thread that does the waking up should not be delayed as much.

Related questions that I could find (but do not answer my question) are

Note that a semaphore seems to be implemented on top of a futex. I'd think that a futex is the fastest possible way on Linux? Perhaps it is faster to use a signal or an interrupt?

Acorn
  • 24,970
  • 5
  • 40
  • 69
Carlo Wood
  • 5,648
  • 2
  • 35
  • 47
  • 1
    What do you mean by avoiding context switching? Are you pinning each thread to each core, not allowing the scheduler to move them or take them out or what do you mean? – Acorn Sep 23 '19 at 15:28
  • 1
    As for the timings, Linux isn't a real-time operating system, so anything goes as they say. However, I have no clue about the soft guarantees of `sem_post` (I guess you have already taken a look at the source code for it given you say it uses a futex). If it is too slow for you, you could always do a notification yourself within user space to a dedicated thread that will run the `sem_post`s (or whatever is needed) for you (assuming you don't have so many that it can't keep up with them). – Acorn Sep 23 '19 at 15:32
  • @Acorn _Linux isn't a real-time operating system, so anything goes as they say_. However, if you do the due diligence on Linux, you may not need a hard real-time operating system at all. – Maxim Egorushkin Sep 23 '19 at 17:31
  • 10 microseconds, on which CPU? For a 40 MHz CPU that's incredibly fast (~400 cycles) and for a 4 GHz CPU it's not as fast but not necessarily bad especially if it might take a few thousand cycles just to bring a previously idle CPU out of a power saving state.. – Brendan Sep 23 '19 at 17:44
  • @Acron When I run more threads than I have cores, the scheduler will have to switch between the threads constantly. But when I have one core for each thread (or, if the threads are sleeping even more cores than running threads) then that is not necessary. – Carlo Wood Sep 23 '19 at 18:01
  • @Acorn And how would I wake up that dedicated thread? Or are you suggesting that that dedicated thread would be spinning, watching some atomic variable? That doesn't seem an option; I'd lose a whole core with that... – Carlo Wood Sep 23 '19 at 18:04
  • @CarloWood Regarding the scheduling: yes, you are reducing context switches, but depending on the workload, syscalls, other processes in the system, etc. you may still have a lot of context switching. – Acorn Sep 23 '19 at 18:07
  • @CarloWood Regarding the dedicated thread: yes, a dedicated thread spinning as fast as you need it looking for memory/L3 changes. You will lose a core, of course, but you get the best latency, which seems your problem, not throughput. – Acorn Sep 23 '19 at 18:10

0 Answers0