Boost: multithread performance, reuse of threads/sockets

Question

I'll first describe my task and then present my questions below.

I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.

We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.

The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.

Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.

My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.

My questions are:

Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)

Kudos +1 are in order for actually doing measurements. Any understanding you lack will come quickly with this investigative mindset. And if you keep the habit, you won't fall into scalability traps thinking you've seen it all :) — sehe, Jan 20 '15 at 13:37

score 1 · Accepted Answer · answered Jan 20 '15 at 13:32

320 threads is chump change in terms of resources, but the scheduling may pose issues.

320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.

I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).

Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second

Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.

So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.

Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)

The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.

That is, assuming you did already use the *_async version of ASIO API functions.

I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.

http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

Boost: multithread performance, reuse of threads/sockets

1 Answers1