1

I've got a program that has about 80 threads. It's running on a ~50ish core machine on linux 3.36. At most there are 2 of these programs running at once, and they are identical. Nothing else is running on the machine.

The threads themselves are real-time linux pthreads with SCHED_RR (round robin) policy.

  • 10 are highest priority (yes, I set ulimit to 99) and have cpu affinity set to 10 of the cores. In other words, they are each pinned to their own core.
  • about 60 are medium priority.
  • about 10 are low priority.

The 10 highest priority threads are constantly using cpu.

The rest are doing network IO as well as doing some work on the CPU. Here's the problem: I'm seeing one of the low priority threads being starved, sometimes over 15 seconds at a time. This specific thread is waiting on a TCP socket for some data. I know the data has been fully sent because I can see that the server on the other end of the connection has sent the data (i.e., it logs a timestamp after sending the data). Usually the thread takes milliseconds to receive and process it, but sporadically it will take 15 seconds after the other server has successfully sent the data. Note that increasing the priority of the thread and pinning it to a CPU has eradicated this issue, but this is not a long-term solution. I would not expect this behavior in the first place - 15 seconds is a very long time.

Does anyone know why this would be happening? We have ruled out that it is any of the logic in the program/threads. Also note that the program is written in C.

rvishy1
  • 11
  • 2
  • You might (also) ask on unix.stackexchange.com. You know, if you are really sure that this is not about "your program", then your question would be actually off-topic here! – GhostCat Oct 19 '16 at 03:47

1 Answers1

1

I would not expect this behavior in the first place - 15 seconds is a very long time.

If your 60 medium-priority threads were all runnable, then that's exactly what you'd expect: with realtime threads then lower-priority threads won't run at all while there's higher-priority threads still runnable.

You might be able to use perf timechart to analyse exactly what's going on.

caf
  • 233,326
  • 40
  • 323
  • 462
  • The policy is round-robin. Am I misunderstanding how priority matters with RR sched policy? – rvishy1 Oct 19 '16 at 22:21
  • @rvishy1: It sounds like it - see the [`sched(7)`](http://manpages.org/sched/7) man page. Threads with a higher static priority always run in preference to those with a lower static priority - *"The scheduling policy determines the ordering only within the list of runnable threads with equal static priority."* (`SCHED_RR` being the policy in this case). – caf Oct 19 '16 at 22:47