6

We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on one Linux kernel versus another. In particular, we see better performance with the 2.6.9 CentOS 4.x (RHEL4) kernel, and worse performance with the 2.6.18 kernel from CentOS 5.x (RHEL5).

By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).

We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.

I have used a few methods of profiling our application, and determined that the added latency on CentOS 5 comes from queue operations (in particular, popping).

However, I can improve performance on CentOS 5 (to be the same as CentOS 4) by using taskset to bind the program to a subset of the available cores.

So it appers to me, between CentOS 4 and 5, there was some change (presumably to the kernel) that caused threads to be scheduled differently (and this difference is suboptimal for our application).

While I can "solve" this problem with taskset (or in code via sched_setaffinity()), my preference is to not have to do this. I'm hoping there's some kind of kernel tunable (or maybe collection of tunables) whose default was changed between versions.

Anyone have any experience with this? Perhaps some more areas to investigate?

Update: In this particular case, the issue was resolved by a BIOS update from the server vendor (Dell). I pulled my hair out quite a while on this one. Until I went back to the basics, and checked my vendor's BIOS updates. Suspiciously, one of the updates said something like "improve performance in maximum performance mode". Once I upgraded the BIOS, CentOS 5 was faster---generally speaking, but particularly in my queue tests, and actual production runs.

Matt
  • 952
  • 2
  • 8
  • 17

2 Answers2

1

Hmm.. if the time taken for a pop() operation from a producer-consumer queue is making a significant difference to the overall performance of your app, I would suggest that the structure of your threads/workFlow is not optimal, somewhere . Unless there is a huge amount of contention on the queues, I would be surprised if any P-C queue push/pop on any modern OS would take more than a µS or so, even if the queue uses kernel locks in a classic 'Computer Science 117 - how to make a bounded P-C queue with three semaphores' manner.

Can you just absorb the functionality of the thread/s that do the least work into those that do the most, so reducing the number of push/pop per overall work item that flows through your system?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Martin James
  • 24,453
  • 3
  • 36
  • 60
  • The dequeueing operations only make a performance impact in the slower case, i.e. the newer RHEL5 kernel. Based on my experiments, my best-guess explanation for this is that the different threads are scheduled on cores in such a way as to lose cache benefits. I forgot to mention, my machine has dual quad-core CPU packages. Intuitively, if there is a shared queue between two threads scheduled across the two CPU _packages_, then performance will be abysmal. But, this is only a guess, hence the question. :) – Matt May 24 '11 at 17:27
  • I see. Now you come to mention it, I do remember once deliberately padding-out an inter-thread comms object to ensure that two instances could not sit on the same cache line. I guess things get even grimmer if two discrete packages are involved :( – Martin James May 24 '11 at 17:33
  • Yeah - I don't know how much data is in your inter-thread classes/struct/whatever, but can you create a pool of them at startup, ensuring that their size/s are, say 4k and on a page boundary? That should reduce cache flushing because no two cores would ever have to operate on the same page of data. – Martin James May 24 '11 at 17:41
  • You could also just use `posix_memalign`, passing in the cache line size as the alignment – bdonlan May 24 '11 at 17:45
  • Oh right - didn't know about 'memalign' - mostly a Windows developer & so I have to do it 'manually' - adding in extra byte buffers and assessing them once to ensure the compiler does not optimize them away :) – Martin James May 24 '11 at 17:49
1

The Linux scheduler has been an intense area of change and contention over the years. You might want to try a very recent kernel and give that a go. Yes, you may have to compile it yourself—it will be good for you. You might also (when you have newer kernel) want to consider putting the different processes in different containers with everything else in an additional one and see if that helps.

As far as other random things to try, you can raise the priority of your various processes, add real time semantics (caution, a buggy program with realtime privs can starve the rest of the system).

Seth Robertson
  • 30,608
  • 7
  • 64
  • 57
  • Without binding threads to cores (i.e. taskset/sched_setaffinity()), the latencies get _significantly_ worse in a mainline 2.6.39 kernel (from elrepo). It appears that whatever changes are being made, it is bad for our type of program. The real meat of my question is, what _are_ these changes? And, short of becoming a kernel expert, is there a way to understand scheduler changes from a conceptual level? – Matt May 30 '11 at 21:58
  • @Matt: AFAIK your best bet is to go through the Kernel Newbies change list http://kernelnewbies.org/Linux26Changes for discussions of performance and scheduler adjustments, and be prepared to test a lot of kernels. – Seth Robertson May 31 '11 at 17:55