4

I have a large Pandas DataFrame in Python that I would like to access in a Julia program (as a Julia DataFrames.DataFrame object). As I would like to avoid writing to disk for each file send from Python to Julia, it seems as though storing the DataFrame in an Apache Arrow/Feather file in a buffer and sending that via TCP from python to Julia is ideal.

I have tried extensively but cannot figure out how to

  1. Write Apache Arrow/Feather files to memory (not storage)
  2. Send them over TCP from python
  3. Access them from the TCP port in Julia

Thanks for your help.

Jack N
  • 324
  • 2
  • 14
  • If this is a single machine I would not use TCP/IP. The best option would be to call Python from Julia or to call Julia from Python and use to power of PyCall to perform this integration (as @quinnj mentioned). Another good option but more tricky could be to use some interprocess communication libraries - serialize the object as an Arrow to memory and than share with the other process. An easier but the same efficient could be to configure a RAM drive - serialize from one process and read from the other. – Przemyslaw Szufel Jun 07 '22 at 22:46
  • I don't have any experience here, but isn't the [Plasma In-Memory Object Store](https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/) an option to consider? – g.a Mar 19 '23 at 23:12

1 Answers1

2

Hmmm, good question. I'm not sure using a TCP socket is necessarily the easiest, since you need one end to be the "server" socket and the other to be the client. So typically the TCP flow is: 1) server binds and listens to a port, 2) server calls to "accept" a new connection, 3) client calls "connect" on the port to initialize connection, 4) once server accepts, the connection is established, then server/client can write data to each other over connected socket.

I've had success doing something similar to what you've described by using mmapped files, though maybe you have a hard requirement to not touch disk at all. This works nicely though because both the python and Julia processes just "share" the mmapped file.

Another approach you could check out is what I setup to do "round trip" testing in the Arrow.jl Julia package: https://github.com/apache/arrow-julia/blob/main/test/pyarrow_roundtrip.jl. It's setup to use PyCall.jl from Julia to share the bytes between python and Julia.

Hope that helps!

quinnj
  • 1,228
  • 9
  • 8