I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the memory and format of data, such that multiple tasks can read this like a treasure map and all find their way to the same data (without copying). I think I can see how this works within Python/Pandas in a single process; it's pretty easy to create an Arrow array, pass it around to different objects, and observe the whole "zero-copy" thing in action.
However, when we talk about cross-system communication without overhead, I am almost entirely lost. For example, how does PySpark convert Java objects to an arrow format, then pass that to Python/Pandas? I've attempted to look at the code here but to a non-java/scala guy it just looks like it's converting spark rows to Arrow objects, then to byteArray
s (line 124), which doesn't seem like zero copy, zero overhead to me.
Likewise, if I wanted to try to pass an Arrow array from Python/pyarrow to, say, Rust (using Rust's Arrow API), I can't wrap my mind around how I'd do that, particularly considering that this approach to calling a Rust function from Python doesn't seem to work with Arrow primitives. Is there a way to point both Rust and Python to same memory address(es)? Do I have to send arrow data as a byteArray somehow?
// lib.rs
#[macro_use]
extern crate cpython;
use cpython::{PyResult, Python};
use arrow::array::Int64Array;
use arrow::compute::array_ops::sum;
fn sum_col(_py: Python, val: Int64Array) -> PyResult<i64> {
let total = sum(val).unwrap();
Ok(total)
}
py_module_initializer!(rust_arrow_2, initrust_arrow_2, Pyinit_rust_arrow_2, |py, m| {
m.add(py, "__doc__", "This module is implemented in Rust.")?;
m.add(py, "sum_col", py_fn!(py, sum_col(val: Int64Array)))?;
Ok(())
});
$ cargo build --release
...
error[E0277]: the trait bound `arrow::array::array::PrimitiveArray<arrow::datatypes::Int64Type>: cpython::FromPyObject<'_>` is not satisfied
--> src/lib.rs:15:26
|
15 | m.add(py, "sum_col", py_fn!(py, sum_col(val: Int64Array)))?;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `cpython::FromPyObject<'_>` is not implemented for `arrow::array::array::PrimitiveArray<arrow::datatypes::Int64Type>`
|
= note: required by `cpython::FromPyObject::extract`
= note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)