1

In Polars-python, I can do this lazy action, it cost about 17ms, and almost the same time cost on a eager version. the data has 100000 rows.

data sample:

code    date    open    close   change_predict  factor  factor_cta
                    
A   2010-01-04  4080.0  4057.0  False   16.0    1.0
B   2010-01-04  4067.0  4066.0  False   16.0    1.0
A   2010-01-05  4066.0  4154.0  False   17.0    1.0
B   2010-01-05  4165.0  4044.0  False   18.0    1.0
A   2010-01-08  4040.0  3981.0  False   17.0    1.0
#python lazy mode 
xx = data.lazy().groupby('date').agg([
    pl.col("code"),
    pl.col("open"),
    pl.col("close"),
    pl.col("change_predict"),
    pl.col("code").is_in(pl.col("code").sort_by('factor').head(5).filter(pl.col("factor_cta")==1)).alias('buy'),
    pl.col("code").is_in(pl.col("code").sort_by('factor').tail(5).filter(pl.col("factor_cta")==0)).alias('sell')
    ]).sort('date').explode(pl.exclude('date'))
xx = xx.collect()

#17.8 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#python eager mode
x = data.groupby('date').agg([
    pl.col("code"),
    pl.col("open"),
    pl.col("close"),
    pl.col("change_predict"),
    pl.col("code").is_in(pl.col("code").sort_by('factor').head(5).filter(pl.col("factor_cta")==1)).alias('buy'),
    pl.col("code").is_in(pl.col("code").sort_by('factor').tail(5).filter(pl.col("factor_cta")==0)).alias('sell')
]).sort('date').explode(pl.exclude('date'))

17.7 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But in Polars-rust, Why the lazy action(build on release mode) is slower?

//rust lazy mode
let mut sw = stopwatch::Stopwatch::new();
sw.restart();
let x = data
       .lazy()
        .groupby([col("date")])
        .agg([
            col("code"),
            col("open"),
            col("close"),
            col("change_predict"),
            col("code").is_in(col("code").sort_by([col("factor")],[false]).head(Some(5)).filter(col("factor_cta").eq(lit(1))).alias("buy"),
            col("code").is_in(col("code").sort_by([col("factor")],[false]).tail(Some(5)).filter(col("factor_cta").eq(lit(0)))).alias("sell"),
        ])
        .unwrap()
        .sort("date", false)
        .explode([col("*").exclude(["date"])])
        .unwrap();
println!("Groupby Date Success {:#?}", sw.elapsed());

//Groupby Date Success 51.4484ms
shape: (102238, 7)

this is print output

And it seems like that groupby.agg(non-lazy) in Polars-rust can't do the same thing(complex expr) like python?

Hakase
  • 211
  • 1
  • 12
  • rust doc example:`df.groupby(["date"])?.agg(&[("temp", &["n_unique", "sum", "min"])])` and some Operations like `count first last sum min max mean median` – Hakase Mar 01 '22 at 02:13
  • Which version of `polars-rust` are you using? Python polars has almost a weekly release so it has a more recent release than rust. The latest rust release is a month ago. – ritchie46 Mar 01 '22 at 07:07
  • polars version = "0.19.1" / rust=1.59 nightly/ win11 – Hakase Mar 01 '22 at 09:44
  • did you compiled in release?, also python stuff is running in C behind... – Netwave Mar 01 '22 at 10:32
  • it's release mode in rust – Hakase Mar 01 '22 at 10:45
  • I think we just made polars faster in the mean time. How does `polars==0.12.15` compare? That version was released around the same time. – ritchie46 Mar 01 '22 at 13:16
  • polars==0.12.15 is a bit slower, 17.1ms bacames 17.7ms but still faster than rust version, the data source is the same parquet file. Time cost only calculate groupby.agg action. I want to use rust version to achieve better performance, but it is strange. – Hakase Mar 01 '22 at 13:54
  • Did you set `mimalloc` as global allocator? – ritchie46 Mar 01 '22 at 16:01
  • 1
    you mean `#[global_allocator] static GLOBAL: MiMalloc = MiMalloc;` ? I didn't do that – Hakase Mar 02 '22 at 01:10

0 Answers0