0

I have a dataset of time series data similar to the following:

    let series_one = Series::new(
        "a",
        (0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
    );
    let series_two = Series::new(
        "b",
        (4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
    );
    let series_three = Series::new(
        "c",
        (8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
    );

    let series_dates = Series::new(
        "date",
        (0..4)
            .into_iter()
            .map(|v| NaiveDate::default() + Duration::days(2 * v))
            .collect::<Vec<_>>(),
    );
    let df = DataFrame::new(vec![series_one, series_two, series_three, series_dates]).unwrap();

Which has the following shape:

shape: (4, 4)
┌─────┬─────┬──────┬────────────┐
│ a   ┆ b   ┆ c    ┆ date       │
│ --- ┆ --- ┆ ---  ┆ ---        │
│ f64 ┆ f64 ┆ f64  ┆ date       │
╞═════╪═════╪══════╪════════════╡
│ 0.0 ┆ 4.0 ┆ 8.0  ┆ 1970-01-01 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 5.0 ┆ 9.0  ┆ 1970-01-02 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 6.0 ┆ 10.0 ┆ 1970-01-03 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 7.0 ┆ 11.0 ┆ 1970-01-04 │
└─────┴─────┴──────┴────────────┘

I would like to apply some function which operates on a slice of the dataframe that contains all previous rows for every row in the dataframe.

If I have some function some_fn:

fn some_fn(_df: DataFrame) -> DataFrame {
    // Do some operation with the dataframe slice that doesn't need to mutate any data and returns a
    // new dataframe with some results
    DataFrame::new(vec![
        Series::new("a_result", vec![1.0, 2.0, 3.0, 4.0]),
        Series::new("b_result", vec![5.0, 6.0, 7.0, 8.0]),
        Series::new("c_result", vec![9.0, 10.0, 11.0, 12.0]),
    ])
    .unwrap()
}

and I attempt to do the following:

    let size = df.column("a").unwrap().len();
    let results = (0..size)
        .into_iter()
        .map(|i| {
            let t = df.head((i + 1).into());
            some_fn(t)
        })
        .reduce(|acc, b| acc.vstack(&b).unwrap())
        .unwrap();

I find that it is exceedingly slow, taking about 1ms to process just 3000 rows this way (this is just benchmarking an empty function, so the time here is not due to some heavy computation, just the slicing time). What is the right way to take full advantage of polars and do this processing efficiently?

BallpointBen
  • 9,406
  • 1
  • 32
  • 62
ChosunOne
  • 684
  • 8
  • 26
  • 1
    Have you tried using LazyFrames rather than DataFrames? DataFrame's eager APIs could cause unnecessary allocations as it copies data with every operation. Using LazyFrames to build the logical plan before performing the operation to extract the final data frame will be more performant. You can use lazy().limit(...) as a replacement for the .head() call you're currently using. – emagers Dec 06 '22 at 22:31
  • It sounds like you want to convert to a LazyFrame, map one of the columns, and when you have the series in the map function, successively compute `some_fn` over slices of it to price a new series. – BallpointBen Dec 07 '22 at 03:37
  • @emagers I'm completely open to using LazyFrames, since conversion is trivial. Perhaps you would like to expand your comment into an answer? – ChosunOne Dec 07 '22 at 06:57

0 Answers0