How to split the same LazyFrame without .clone()?

Question

I have a quite big .arrow file (150Gb) I need to split into various parts based on some filters. I started using the Python Polars implementation for prototyping (JIT compiler really helped here) and now I was trying to port my working Python implementation into Rust to hopefully speed up the process in production. My code looks like this (obviously my code has more complex filters, but I simplified the problem here):

use polars::prelude::*;
use polars_io::ipc::IpcReader;

fn main() -> PolarsResult<()> {
    // opening .arrow file and parse it as LazyFrame
    let file = std::fs::File::open("./myData.arrow").expect("file not found");
    let df = IpcReader::new(file).finish()?.lazy();

    // Split LazyFrame into various parts
    let df1 = df.filter( col("Volume").gt_eq(0).and( col("Volume").lt(1000) ) );
    let df2 = df.filter( col("Volume").gt_eq(1000).and( col("Volume").lt(2000) ) );
    // ... much more filters
    // let df300 = df.filter( col("Volume").gt_eq(n).and( col("Volume").lt(n+1) ) );

    // printing LazyFrames here to simplify the issue, in the real world I would create new .arrow files based on each filters
    println!("{}", df1.collect()?);
    println!("{}", df2.collect()?);
    // ...
    // println!("{}", df300.collect()?);
    Ok(())
}

Obviously, this code won't compile because the borrow-checker wants me to .clone() my df LazyFrame. So here is the working code AFTER .clone():

use polars::prelude::*;
use polars_io::ipc::IpcReader;

fn main() -> PolarsResult<()> {
    // opening .arrow file and parse it as LazyFrame
    let file = std::fs::File::open("./myData.arrow").expect("file not found");
    let df = IpcReader::new(file).finish()?.lazy();

    // Split LazyFrame into various parts
    let df1 = df.clone().filter( col("Volume").gt_eq(0).and( col("Volume").lt(1000) ) );
    let df2 = df.clone().filter( col("Volume").gt_eq(1000).and( col("Volume").lt(2000) ) );
    // ... much more filters
    // let df300 = df.clone().filter( col("Volume").gt_eq(n).and( col("Volume").lt(n+1) ) );

    // printing LazyFrames here to simplify the issue, in the real world I would create new .arrow files based on each filters
    println!("{}", df1.collect()?);
    println!("{}", df2.collect()?);
    // ...
    // println!("{}", df300.collect()?);
    Ok(())
}

Now the compiler does not complain so I can build it using:

cargo build --release

The problem is that I see at least a 60x performance hit comparing to the same implementation in Python. So it got me confused a bit. I understand the multiple .clone() are the cause of slower performance, however my initial understanding was that Polars would optimize such .clone() in the context of a LazyFrame and avoid unnecessary copies of data, or am I wrong?

Candidly I tried to just filter the reference of df using &df but the borrow-checker complained too. I somehow managed to use reference instead of copy using a DataFrame instead of a LazyFrame but then I somehow lose a lot of methods I use to have in Python (like with_columns is not available for DataFrame in Rust, or filters need masks instead of expressions etc...).

Am I missing some obvious thing? What would be the correct way to RE-use my original df LazyFrame instead of .clone()-ing it all over and preferably without using DataFrames directly :) ?

How to split the same LazyFrame without .clone()?

0 Answers0