2

I need to establish communication between a Scala process (JeroMQ) and a C process (ZeroMQ). The Scala process needs to send large arrays (100 million floats per array). This is first converted to a JSON string, and as you can see below, I am running into memory issues:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.lang.StringCoding.encode(StringCoding.java:350)
    at java.lang.String.getBytes(String.java:939)
    at org.zeromq.ZMQ$Socket.send(ZMQ.java:1276)

100 million floats correspond to 762 MB. It looks to me that the serialized JSON string is becoming huge. If yes, what is the best way to transfer data of this size.

user1274878
  • 1,275
  • 4
  • 25
  • 56
  • Does your use-case somehow benefit or **does it rather penalise** for such vast amount of data transfer? Normally, latency-motivated & real-time system designs may rather benefit from distributed processing and thus transfering just a minimalistic decision-making product ( protected for a fail-safe modus operandi ) needed in the remote process. **Did you consider this much more efficient distributed-processing way?** – user3666197 Mar 29 '16 at 07:30
  • "best" is opinion-based, hence voting to close. If you want to fix this, define the precise criteria for "best". BTW: Have you searched the web for the error message? – Ulrich Eckhardt Mar 29 '16 at 08:05
  • @user3666197 : I am integrating my OpenCL (C code) with Apache Spark. A Spark client running on a node will receive a large amount of data, which will be sent to the C process and eventually transferred to a GPU(s). So, yes, I do need the large amount of data to be transferred between the two processes. – user1274878 Mar 29 '16 at 16:49
  • as explained below, your options would be much more promising if you manage to maintain a continuously "mirror-ed" or "shadow-ed" replica of data, which gets incrementally built in a natural order/tempo with the state-full evolution of the underlying model ( observations ). Any attempt to move an ad-hoc copy ( via any form of a BLOB ) ex-post will principally suffer from the below explained adverse impacts. – user3666197 Mar 30 '16 at 06:57

3 Answers3

1

As ZeroMQ's FAQ page suggests, you can use any data marshalling format which is supported in both Java (and so Scala) and C. There are a lot of those (for some C support is third-party, though C++ usually isn't): Protocol Buffers, MsgPack, Avro, Thrift, BSON, etc.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
1

Size? No, a transport-philosophy related constraint matters.

There is a bit more important issue in ZeroMQ transport orchestration than a choice of an external data-serialiser SER/DES policy.

No one may forbid you to try to send as big BLOB as possible, whereas a JSON-decorated string has already shown you the dark-side of such approaches, there are other reasons not to proceed this way ahead.

ZeroMQ is out of question a great and powerful toolbox. Still it takes some time for one to gain an insight necessary for indeed a smart and highly performant code-deployment, that makes maximum out of this powerful work-horse.

One of side-effects of the feature-rich internal ecosystem "under-the-hood" is a not very much known policy, hidden in a message delivery concept.

One may send any reasonable-sized message, while a delivery is not guaranteed. It is either completely delivered, or nothing gets out at all, as said above, nothing is guaranteed.

Ouch?!

Yes, not guaranteed.

Based on this core Zero-Guarantee philosophy, one shall take due care to decide on steps and measures, the more if you plan to try to move "Gigabyte BEASTs" there and back.

In this very sense, it might become quantitatively supported by real SUT testing, that small-sized messages may transport ( if you indeed still need to move GB-s ( ref. to comment above, under the OP ) and have no other choice ) the whole volume of data segmented into smaller pieces, with error-prone re-assembly measures, which results in much faster and much safer end-to-end solution than trying to use dumb-force and instruct the code to dump about a GB of data onto whatever resources there actually are available ( Zero-Copy principle of ZeroMQ cannot and will not per-se save you in these efforts ).

For details on another hidden trap, related to not fully Zero-Copy implementation, read Martin SUSTRIK's, co-father of ZeroMQ, remarks on Zero-Copy "till-kernel-boundary-only" ( so, at least double the memory-space allocations to be expected... ).


The best next step?

While it does not solve your troiuble with a few SLOC-s, the best thing, if you are serious about to invest your intellectual powers into distributed processing, is to read Pieter HINTJEN's lovely book "Code Connected, Vol.1"

Yes, it takes some time to generate one's own insight, but this will raise you in many aspects onto another level of professional code design. Worth time. Worth efforts.

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92
1

First things first, there's nothing inherent to json or any other data serialization format that makes it non-viable for large data sets - you just have to make sure that your machine has the necessary resources to process it.

Certain formats might be more memory efficient than others, most likely a binary format is going to suit you better.

However, depending on your circumstances (e.g. if you constantly need updated access to the entire dataset) then user3666197's answer is probably more suited to your scenario.

Allow me to split the difference.

If your use case fits the following parameters:

  1. You need infrequent access to the entire data set
  2. You can deal with long latency times
  3. You cannot increase the resources available at the receiving host
  4. You cannot (or it is prohibitively difficult to) create a continuously updated local data store at the receiving host

... then your best bet is simple splitting of the data set. See how large of a message you can send and parse without running out of resources, give yourself anywhere between a 20-50% buffer (depending on your tolerance), split your data set up into chunks that size, send the chunks and reassemble them. This is working under the assumption that the memory problem is resulting from dealing with both the serialized and unserialized data in memory at the same time during the unserialization process. If that's not true and the unserialized data set is itself too large to fit in memory, then you'll just have to process the data in chunks without reassembling them. If that's the case, I would strongly recommend finding some way to increase your memory resources, because you're living on the edge.

Jason
  • 13,606
  • 2
  • 29
  • 40