Below is a simple example of a groupby-agg
operation where I want to return an array/vector
of min
and max
values of each group as a single column in the result.
#[pyfunction]
fn test_fn(pydf: PyDataFrame, colnm: &str, by_cols: Vec<&str>) -> PyResult<PyDataFrame> {
let df: DataFrame = pydf.into();
let res = df
.lazy()
.groupby(by_cols)
.agg([col(colnm).apply(
|s| {
let v: Vec<f64> = vec![s.min().unwrap(), s.max().unwrap()];
Ok(Some(Series::new("s", v)))
},
GetOutput::default(),
)])
.collect()
.map_err(PyPolarsErr::from)?;
Ok(PyDataFrame(res))
}
#[pymodule]
fn test_module(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(test_fn, m)?)?;
Ok(())
}
As you can see from the following code section, column a
in the resulting dataframe
contains a list of two elements (min
and max
).
import polars as pl
import test_module
df = pl.DataFrame(
{"a": [1.0, 2.0, 3.0, 4.0, 5.0], "g1": [1, 1, 2, 2, 2], "g2": [1, 1, 1, 2, 2]}
)
>>> test_module.test_fn(df, "a", ["g1", "g2"])
shape: (3, 3)
┌─────┬─────┬────────────┐
│ g1 ┆ g2 ┆ a │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[f64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 1 ┆ [1.0, 2.0] │
│ 2 ┆ 2 ┆ [4.0, 5.0] │
│ 2 ┆ 1 ┆ [3.0, 3.0] │
└─────┴─────┴────────────┘
Now, I am curious how can I modify my test_fn
above to make it return a struct/dict/hashmap
instead of a vector
, with the benefit of having named fields in the result?
More specifically, what I want is:
>>> test_module.test_fn(df, "a", ["g1", "g2"])
shape: (3, 3)
┌─────┬─────┬───────────┐
│ g1 ┆ g2 ┆ a │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ struct[2] │
╞═════╪═════╪═══════════╡
│ 1 ┆ 1 ┆ {1.0,2.0} │
│ 2 ┆ 2 ┆ {4.0,5.0} │
│ 2 ┆ 1 ┆ {3.0,3.0} │
└─────┴─────┴───────────┘
Or
>>> test_module.test_fn(df, "a", ["g1", "g2"])
shape: (3, 4)
┌─────┬─────┬───────┬───────┐
│ g1 ┆ g2 ┆ a_min ┆ a_max │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪═════╪═══════╪═══════╡
│ 2 ┆ 1 ┆ 3.0 ┆ 3.0 │
│ 2 ┆ 2 ┆ 4.0 ┆ 5.0 │
│ 1 ┆ 1 ┆ 1.0 ┆ 2.0 │
└─────┴─────┴───────┴───────┘