0

I'm using py4j to send a byte array (Array[Byte]) from Scala to python. On the python side I wish to create a numpy array (preferably immutable) that is just a view of these bytes but interpreted as np.complex128. Disregarding byte order, the bytes are ordered as follows: real1, imag1, real2, imag2, ....

According to py4j documentation for python 3.5 and py4j 0.10.3, it seems like I'm should get a bytes object on the python side but I'm actually getting a JavaArray which as I understand it has a reference back to the array on the jvm side which I think makes this pretty slow. I'm guessing this is due to Scala's "autoboxing" of byte to Byte (the class) but I'm not sure.

Py4j question: Is it possible to force py4j to return a copy of the bytes?

Scala question: Maybe my guess is wrong and it actually compiles down to primitive byte array in this case? If not, is it possible to make sure it does in anyway besides writing that part in Java instead.

John Pertoft
  • 1,045
  • 1
  • 9
  • 17
  • So I tested to write that part in java, and it works as expected which confirmed my suspicion, but my questions remain. – John Pertoft Oct 06 '16 at 10:24
  • Actually, I'm stupid. The problem was not with Scala but rather that I was sending what was effectively a byte[][] which then turned into a JavaArray where each element was a python bytes object. Still not sure exactly where things are kept in this case because doing something with each element of that JavaArray is still slow. – John Pertoft Oct 06 '16 at 13:38

1 Answers1

1

The only way to force Py4J to get a bytearray in Python is to make sure Java is sending a byte[].

I'm currently working on a new binary protocol (0.11) that will make these types of transfer faster and that will make it easy to write adapters for these scenarios. There is no plan to natively support boxed primitive arrays, but you may want to look at spylon, a collection of utilities to work with Scala and Py4J.

Another possibility: the Spark team uses Py4J to interact with Scala but uses a secondary socket to transfer large byte arrays because this is currently not a fast operation with Py4J.

Barthelemy
  • 8,277
  • 6
  • 33
  • 36
  • I actually realised that Scala wasn't the problem. What I'm trying to do is actually sending what is effectively a byte[][] which I guess then becomes a JavaArray since it's not a byte[]. Does this mean it is still just a reference to the bytes on the jvm side? Where are the bytes actually stored? – John Pertoft Oct 06 '16 at 13:42
  • Bytes stay on the JVM in that particular case and it will be relatively slow (one roundtrip per byte access) – Barthelemy Oct 06 '16 at 15:45