Q : Why is it faster sending ... ? Can someone explain why ?
A :
+1 for having asked WHY -
people who do understand WHY are those, that strive to learn to the roots of the problems, so as to truly understand the core reasons & thus can next design better systems, knowing the very WHY ( taking no shortcuts in mimicking emulating or copy/paste following someone else )
So, let's start :
HDR is not the SDR,
we will have "a lot of DATA" here to acquire - store - process - send,

Inventory of facts
- in this order : DATA, process, .send()
, who gets faster & WHY
DATA :
were defined to be 4K-HDR sized array of triple-data-values of a numpy
provided default dtype
, where ITU-T Recommendation BT-2100 HDR colourspace requires at least 10-bit for increased colour dynamics-ranges
The as-is code delivers numpy.random.randn( c4K, r4K, 3 )
's default
dtype
of np.float64
. Just for the sake of proper & right-sized system design, the HDR ( extending a plain 8-bit sRGB triple-byte colourspace ) shall always prefer int{10|12|16|32|...}
-based storage, not to skew any numerical image post-processing in pipeline's later phase(s).
process :
Actual message-payload generating processes were defined to be
Case-A ) np.array2string( )
followed by an .encode()
method
Case-B ) a numpy.ndarray
-native (sic) .asbytes()
-method
.send()
:
ZeroMQ Scalable Formal Communication Archetype pattern (of unknown type) finally receives a process-generated message-payload, into a ( blocking-form of the ) .send()
-method
Solution of WHY & tips for HOW :
The core difference is hidden in the fact, that we try to compare apples to oranges.
>>> len( np.random.randn( c4K, r4K, 3 ).tobytes() ) / 1E6
199.0656 [MB]
>>> len( np.array2string( np.random.randn( c4K, r4K, 3 ) ) ) / 1E6
0.001493 [MB] ... Q.E.D.
While the (sic) .asbytes()
-method produces a full copy ( incl. RAM-allocation + RAM-I/O-traffic [SPACE]
+ [TIME]
-domains' costs ), i.e. spending some extra us
before ZeroMQ starts a .send()
-method ZeroCopy magicks :
print( np.random.randn( c4K, r4K, 3 ).tobytes.__doc__ )
a.tobytes(order='C')
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of
data memory. The bytes object is produced in C-order by default.
This behavior is controlled by the ``order`` parameter.
.. versionadded:: 1.9.0
the other case, the Case-A, first throws away (!), and a lot (!)... depending here on actual numpy
matrix-UI-presentation configuration settings, lot of original 4K-HDR DATA even before moving them into the .encode()
-phase :
>>> print( np.array2string( np.random.randn( c4K, r4K, 3 ) ) )
[[[ 1.54482944 -0.23189048 -0.67866246]
...
[ 0.13461456 1.47855833 -1.68885902]]
[[-0.18963557 -1.1869201 1.34843493]
...
[-0.3022641 -0.44158803 0.75750368]]
[[-1.05737969 0.864752 0.36359686]
...
[ 1.70240612 -0.12574642 -1.03325878]]
...
[[ 0.41776933 1.73473723 0.28723299]
...
[-0.47635911 0.15901325 -0.56407537]]
[[-1.41571874 1.66735309 0.6259928 ]
...
[-0.93164127 0.95708002 1.3470873 ]]
[[ 0.16426176 -0.00317156 0.77522962]
...
[ 0.32960196 -1.74369368 -0.34177759]]]
So, sending less-DATA means taking less time to move them.
Tips HOW :
ZeroMQ methods & the overall performance will benefit from using zmq.DONTWAIT
flag, when passing a reference to the .send()
-method
try to re-use the most of the great numpy
-tooling, where possible, to minimise repetitive RAM-allocation(s) ( we may pre-allocate & re-use once allocated variable )
try to use as compact DATA-representation as possible, if hunting for maximum performance with minimum latency - redundancy-avoided, compact, CPU-cache-lines' hierarchy & associativity matching formats will always win in the race for ultimate performance ( using a view of internal numpy
-storage area, i.e. without using any mediating methods to read-access the actual block of 4K-HDR data may help to move the whole pipeline to become ZeroCopy down to the ZeroMQ .send()
-pushing the DATA-references only ( i.e. without copying or moving a single byte of DATA from / into RAM, up until loading it onto the wire ... ) ... which is the coolest performance result of our design efforts here, isn't it? )
in any case, in all critical sections, avoid effects of blocking the flow by gc.disable()
, to at least defer a potential .collect()
not to happen "here"