I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
- Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
- Any hints as to whats going wrong?
- Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)