4

I have a dataset where I need to do groupby operation on different columns. Here is minimal working code using polars version "0.21.1"

use polars::prelude::*;
use polars_lazy::prelude::*;
use polars::df;

fn main(){
  let df = df![
    "x1" => ["a", "b", "c", "a"],
    "x2" => ["A", "A", "B", "B"],
    "y" => [1, 2, 3, 4],
    ].unwrap();

  let lf: LazyFrame = df.lazy();

  let out1 = groupby_x1(&lf);
  println!("{:?}", out1.collect());
  let out2 = groupby_x2(&lf);
  println!("{:?}", out2.collect());

}

fn  groupby_x1(lf: &LazyFrame) -> LazyFrame {
  let lf1: LazyFrame = lf.clone().groupby([col("x1")]).agg([
    col("y").sum().alias("y_sum"),
  ]);
  lf1
}

fn  groupby_x2(lf: &LazyFrame) -> LazyFrame {
  let lf1: LazyFrame = lf.clone().groupby([col("x2")]).agg([
    col("y").sum().alias("y_sum"),
  ]);
  lf1
}

But in the code I am making deep copies of whole lazyframe lf (using lf.clone(). How can I avoid that? If I replace lf.clone() with lf in functions groupby_x1 and groupby_x2 I get following error

error[E0507]: cannot move out of `*lf` which is behind a shared reference
  --> src/main.rs:22:24
   |
22 |   let lf1: LazyFrame = lf.groupby([col("x1")]).agg([
   |                        ^^^^^^^^^^^^^^^^^^^^^^^ move occurs because `*lf` has type `polars_lazy::frame::LazyFrame`, which does not implement the `Copy` trait

error[E0507]: cannot move out of `*lf` which is behind a shared reference
  --> src/main.rs:29:24
   |
29 |   let lf1: LazyFrame = lf.groupby([col("x2")]).agg([
   |                        ^^^^^^^^^^^^^^^^^^^^^^^ move occurs because `*lf` has type `polars_lazy::frame::LazyFrame`, which does not implement the `Copy` trait

For more information about this error, try `rustc --explain E0507`.
error: could not compile `polars_try` due to 2 previous errors
Kushdesh
  • 1,118
  • 10
  • 16

2 Answers2

5

Polars Series are a newtype around Arc<Vec<ArrowRef>>. When you clone a DataFrame only the reference count of the Arc is incremented.

In other words, polars never does deep clones. Clones of a DataFrame are super cheap.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
1

From the documentation, LazyFrame:

Lazy abstraction over an eager DataFrame. It really is an abstraction over a logical plan. The methods of this struct will incrementally modify a logical plan until output is requested (via collect)

Meaning there is no deep copy of the Dataframe, nothing is performed until you actually collect it.

Hence you have two options:

  1. You keep copying them if you want to keep the original plan intact
  2. You take ownership of the plan groupby_x1(lf: LazyFrame), and let the user of the function deal with the need of actually cloning the original plan if needed.
Netwave
  • 40,134
  • 6
  • 50
  • 93