4

I'm playing with pyzmq for inter-process transfer of 4k HDR image data and noticed that this:

byt = np.array2string(np.random.randn(3840,2160,3)).encode()
while True:
   socket.send(byt)

is much much faster than:

byt = np.random.randn(3840,2160,3).asbytes()
while True:
   socket.send(byt)

Can someone explain why? I can't seem to wrap my head around it.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
roque2205
  • 49
  • 2

1 Answers1

2

Q : Why is it faster sending ... ? Can someone explain why ?

A :
+1 for having asked WHY -
people who do understand WHY are those, that strive to learn to the roots of the problems, so as to truly understand the core reasons & thus can next design better systems, knowing the very WHY ( taking no shortcuts in mimicking emulating or copy/paste following someone else )

So, let's start :

enter image description here HDR is not the SDR,
we will have "a lot of DATA" here to acquire - store - process - send,
enter image description here


Inventory of facts
- in this order : DATA, process, .send(), who gets faster & WHY

DATA :
were defined to be 4K-HDR sized array of triple-data-values of a numpy provided default dtype, where ITU-T Recommendation BT-2100 HDR colourspace requires at least 10-bit for increased colour dynamics-ranges

The as-is code delivers numpy.random.randn( c4K, r4K, 3 )'s default dtype of np.float64. Just for the sake of proper & right-sized system design, the HDR ( extending a plain 8-bit sRGB triple-byte colourspace ) shall always prefer int{10|12|16|32|...}-based storage, not to skew any numerical image post-processing in pipeline's later phase(s).

process :
Actual message-payload generating processes were defined to be
Case-A ) np.array2string( ) followed by an .encode() method

Case-B ) a numpy.ndarray-native (sic) .asbytes()-method

.send() :
ZeroMQ Scalable Formal Communication Archetype pattern (of unknown type) finally receives a process-generated message-payload, into a ( blocking-form of the ) .send()-method


Solution of WHY & tips for HOW :

The core difference is hidden in the fact, that we try to compare apples to oranges.

>>> len(                  np.random.randn( c4K, r4K, 3 ).tobytes() ) / 1E6
199.0656 [MB]

>>> len( np.array2string( np.random.randn( c4K, r4K, 3 ) )         ) / 1E6
0.001493 [MB] ... Q.E.D.

While the (sic) .asbytes()-method produces a full copy ( incl. RAM-allocation + RAM-I/O-traffic [SPACE] + [TIME]-domains' costs ), i.e. spending some extra us before ZeroMQ starts a .send()-method ZeroCopy magicks :

print(                        np.random.randn( c4K, r4K, 3 ).tobytes.__doc__ )
a.tobytes(order='C')

   Construct Python bytes containing the raw data bytes in the array.

   Constructs Python bytes showing a copy of the raw contents of
   data memory. The bytes object is produced in C-order by default.
   This behavior is controlled by the ``order`` parameter.

   .. versionadded:: 1.9.0


the other case, the Case-A, first throws away (!), and a lot (!)... depending here on actual numpy matrix-UI-presentation configuration settings, lot of original 4K-HDR DATA even before moving them into the .encode()-phase :

 >>> print( np.array2string( np.random.randn( c4K, r4K, 3 ) ) )
 [[[ 1.54482944 -0.23189048 -0.67866246]
   ...
   [ 0.13461456  1.47855833 -1.68885902]]
 
  [[-0.18963557 -1.1869201   1.34843493]
   ...
   [-0.3022641  -0.44158803  0.75750368]]
 
  [[-1.05737969  0.864752    0.36359686]
   ...
   [ 1.70240612 -0.12574642 -1.03325878]]
 
  ...
 
  [[ 0.41776933  1.73473723  0.28723299]
   ...
   [-0.47635911  0.15901325 -0.56407537]]
 
  [[-1.41571874  1.66735309  0.6259928 ]
   ...
   [-0.93164127  0.95708002  1.3470873 ]]
 
  [[ 0.16426176 -0.00317156  0.77522962]
   ...
   [ 0.32960196 -1.74369368 -0.34177759]]]
 

So, sending less-DATA means taking less time to move them.

Tips HOW :

  1. ZeroMQ methods & the overall performance will benefit from using zmq.DONTWAIT flag, when passing a reference to the .send()-method

  2. try to re-use the most of the great numpy-tooling, where possible, to minimise repetitive RAM-allocation(s) ( we may pre-allocate & re-use once allocated variable )

  3. try to use as compact DATA-representation as possible, if hunting for maximum performance with minimum latency - redundancy-avoided, compact, CPU-cache-lines' hierarchy & associativity matching formats will always win in the race for ultimate performance ( using a view of internal numpy-storage area, i.e. without using any mediating methods to read-access the actual block of 4K-HDR data may help to move the whole pipeline to become ZeroCopy down to the ZeroMQ .send()-pushing the DATA-references only ( i.e. without copying or moving a single byte of DATA from / into RAM, up until loading it onto the wire ... ) ... which is the coolest performance result of our design efforts here, isn't it? )

  4. in any case, in all critical sections, avoid effects of blocking the flow by gc.disable(), to at least defer a potential .collect() not to happen "here"

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92