I have a quite big .arrow
file (150Gb) I need to split into various parts based on some filters. I started using the Python Polars implementation for prototyping (JIT compiler really helped here) and now I was trying to port my working Python implementation into Rust to hopefully speed up the process in production.
My code looks like this (obviously my code has more complex filters, but I simplified the problem here):
use polars::prelude::*;
use polars_io::ipc::IpcReader;
fn main() -> PolarsResult<()> {
// opening .arrow file and parse it as LazyFrame
let file = std::fs::File::open("./myData.arrow").expect("file not found");
let df = IpcReader::new(file).finish()?.lazy();
// Split LazyFrame into various parts
let df1 = df.filter( col("Volume").gt_eq(0).and( col("Volume").lt(1000) ) );
let df2 = df.filter( col("Volume").gt_eq(1000).and( col("Volume").lt(2000) ) );
// ... much more filters
// let df300 = df.filter( col("Volume").gt_eq(n).and( col("Volume").lt(n+1) ) );
// printing LazyFrames here to simplify the issue, in the real world I would create new .arrow files based on each filters
println!("{}", df1.collect()?);
println!("{}", df2.collect()?);
// ...
// println!("{}", df300.collect()?);
Ok(())
}
Obviously, this code won't compile because the borrow-checker wants me to .clone()
my df
LazyFrame. So here is the working code AFTER .clone()
:
use polars::prelude::*;
use polars_io::ipc::IpcReader;
fn main() -> PolarsResult<()> {
// opening .arrow file and parse it as LazyFrame
let file = std::fs::File::open("./myData.arrow").expect("file not found");
let df = IpcReader::new(file).finish()?.lazy();
// Split LazyFrame into various parts
let df1 = df.clone().filter( col("Volume").gt_eq(0).and( col("Volume").lt(1000) ) );
let df2 = df.clone().filter( col("Volume").gt_eq(1000).and( col("Volume").lt(2000) ) );
// ... much more filters
// let df300 = df.clone().filter( col("Volume").gt_eq(n).and( col("Volume").lt(n+1) ) );
// printing LazyFrames here to simplify the issue, in the real world I would create new .arrow files based on each filters
println!("{}", df1.collect()?);
println!("{}", df2.collect()?);
// ...
// println!("{}", df300.collect()?);
Ok(())
}
Now the compiler does not complain so I can build it using:
cargo build --release
The problem is that I see at least a 60x performance hit comparing to the same implementation in Python. So it got me confused a bit. I understand the multiple .clone()
are the cause of slower performance, however my initial understanding was that Polars would optimize such .clone()
in the context of a LazyFrame
and avoid unnecessary copies of data, or am I wrong?
Candidly I tried to just filter the reference of df
using &df
but the borrow-checker complained too.
I somehow managed to use reference instead of copy using a DataFrame
instead of a LazyFrame
but then I somehow lose a lot of methods I use to have in Python (like with_columns
is not available for DataFrame
in Rust, or filters need masks
instead of expressions etc...).
Am I missing some obvious thing? What would be the correct way to RE-use my original df
LazyFrame instead of .clone()
-ing it all over and preferably without using DataFrames
directly :) ?