7

I acquire samples (integers) at a very high rate (several kilo samples per seconds) in a thread and put() them in a threading.Queue. The main thread get()s the samples one by one into a list of length 4096, then msgpacks them and finally sends them via ZeroMQ to a client. The client shows the chunks on the screen (print or plot). In short the original idea is, fill the queue with single samples, but empty it in large chunks.

Everything works 100% as expected. But the latter part i.e. accessing the queue is very very slow. Queue gets larger and the output always lags behind by several to tens of seconds.

My question is: how can I do something to make queue access faster? Is there a better approach?

user3666197
  • 1
  • 6
  • 50
  • 92
xaratustra
  • 647
  • 1
  • 14
  • 28
  • 3
    Are you sure your bottleneck is queue operations and not the client operation? – aph Sep 20 '16 at 15:54
  • 1
    `collections.deque` is much faster than `threading.Queue` and also threadsafe but does not have all the features. Maybe `multiprocessing.dummy` (which actually uses threads) is worth a look, too for you. – janbrohl Sep 20 '16 at 16:51
  • 2
    You could produce complete `list`s with 4096 samples in the sampling-thread and then put those lists as single items in the Queue - this would require less comparably slow calls to Queue-methods. – janbrohl Sep 20 '16 at 16:56
  • 1
    @aph yes, I could check that, in that I temporarily sent data directly from the sampling thread to the client. Though not totally fast, it was much faster than with `queue`, but comes at the cost of loss of samples of course. – xaratustra Sep 20 '16 at 19:05
  • 1
    @janbrohl : making lists on the sampling thread **greatly** improved the speed! It nearly perfect now. Thanks so much for the hint! I also changed the chunk size. The speed seems to be also dependent on the chunk size. It seems that the optimum chunk size is 1024, not more not less. I am going to check your other suggestion on `queue` and `multiprocessing` to see if it can get any better. – xaratustra Sep 20 '16 at 21:01

1 Answers1

1

Q : "Is there a better approach?"

A :
Well, my ultimate performance-candidate would be this :

  • the sampler will operate two or more, separate, statically preallocated "circular"-buffers, one for storing in phase one, the other thus free-to get sent and vice-verse
  • once the sampler's filling reaches the end of the first buffer, it starts filling the other, sending the first one and vice versa
  • ZeroMQ zero-copy, zero-blocking .send( zmq.NOBLOCK ) over an inproc:// transport-class uses just memory-pointer mapping, without moving data in-RAM ( or we can even further reduce the complexity, if moving the filled-up buffer right from here directly to the client, w/o any mediating party ( if not needed otherwise ) for doing so, if using a pre-allocated, static storage,
    like a numpy.array( ( bufferSize, nBuffersInRoundRobinCYCLE ), dtype = np.int32 ), we can just send an already packed-block of { int32 | int64 }-s or other dtype-mapped data using .data-buffer, round-robin cycling along the set of nBuffersInRoundRobinCYCLE-separate inplace storage buffers (used for sufficient latency-masking, filling them one after another in cycle and letting them get efficiently .send( zmq.NOBLOCK )-sent in the "background" ( behind the back of the Python-GIL-lock blocker tyrant ) in the meantime as needed ).

Tweaking Python-interpreter, disabling gc.disable() at all and tuning the default GIL-lock smooth processing "meat-chopper" from 100[ms] somewhere reasonably above, as no threading is needed anymore, by sys.settimeinterval() and moving several acquired samples in lump multiples of CPU-words ~up~to~ CPU-cache-line lengths ( aligned for reducing the fast-cache-to-slow-RAM-memory cache-consistency management mem-I/O updates ) are left for the next LoD of bleeding performance boosting candidates

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92