Partition by column value in rust using Arrow/Datafusion/Polars (like python panda's groupby)?

Question

I am looking for an equivalent of the convenient python panda syntax:

#df is a pandas dataframe

for fruit, sub_df in df.groupby('fruits'):
    # Do some stuff with sub_df and fruit

It is basically a groupby, where each group can be accessed as a single dataframe alongside its label (the common value in the grouping column).

I had a look to data fusion but I can't reproduce this behavior without having to first select all the unique values and second execute one select par value which result to re-parsing the whole file multiple times. I had a look to the Polars crate which seamed promising but wan't able to reach my goal either.

How would you do this in similar/better performance as the python code? I am open to any syntax / library / approche that would allow me efficiently to partition the parquet file by values of a fixed column.

Here is a rust sample code using polar as an example of what kind of input I am dealing with:

    let s0 = Series::new("fruits", ["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"].as_ref());
    let s1 = Series::new("maturity", ["A", "B", "A", "C", "A", "D"].as_ref());
    let s1 = Series::new("N", [1, 2, 2, 4, 2, 8].as_ref());
    // create a new DataFrame
    let df = DataFrame::new(vec![s0, s1, s2]).unwrap();

    // I would like to loop on all fruits values, each time with a dataframe containing only the records with this fruit.

ritchie46 · Answer 1 · 2022-06-19T07:57:47.280

If you activate the partition_by feature, polars exposes

DataFrame::partition_by and DataFrame::partition_by_stable.

use polars::prelude::*;

fn main() -> Result<()> {
    let partitioned = df! {
        "fruits" => &["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"],
        "maturity" => &["A", "B", "A", "C", "A", "D"],
        "N" => &[1, 2, 2, 4, 2, 8]

    }?.partition_by(["fruits"])?;

    dbg!(partitioned);
    Ok(())
}

Running this prints:

[src/main.rs:17] partitioned = [
    shape: (4, 3)
    ┌────────┬──────────┬─────┐
    │ fruits ┆ maturity ┆ N   │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ i32 │
    ╞════════╪══════════╪═════╡
    │ Pear   ┆ A        ┆ 2   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ C        ┆ 4   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ A        ┆ 2   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Pear   ┆ D        ┆ 8   │
    └────────┴──────────┴─────┘,
    shape: (2, 3)
    ┌────────┬──────────┬─────┐
    │ fruits ┆ maturity ┆ N   │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ i32 │
    ╞════════╪══════════╪═════╡
    │ Apple  ┆ A        ┆ 1   │
    ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
    │ Apple  ┆ B        ┆ 2   │
    └────────┴──────────┴─────┘,
]

Hey @ritchie46, the behavior of get_groups has changed: `df.groupby(["groups"])?.get_groups().iter().next()` returns *GroupIndicator*. How do you go from *GroupIndicator* to actual sub_df? Btw I am happy to update your answer (if I can) — Anatoly Bugakov, Jun 18 '22 at 10:52
There is currently actually a `partition_by` method. That should be much easier and faster. — ritchie46, Jun 18 '22 at 11:30
Thanks alot for prompt responce! I couldn't find any mentioning of "partition_by" in the documentation? — Anatoly Bugakov, Jun 18 '22 at 13:10

loremdipso · Answer 2 · 2020-10-13T16:02:16.400

You can do something like:

    let columns = df.columns();
    if let Ok(grouped) = df.groupby("fruits") {
        let sub_df = grouped.select(columns).agg_list()?;
        dbg!(sub_df);
    }

And that'll technically leave you with a dataframe. The problem then is that the dataframe's columns are arrays with all the values for each fruit, which might not be what you want.

+---------+--------------------------------------+-----------------------------+-----------------+
| fruits  | fruits_agg_list                      | maturity_agg_list           | N_agg_list      |
| ---     | ---                                  | ---                         | ---             |
| str     | list [str]                           | list [str]                  | list [i32]      |
+=========+======================================+=============================+=================+
| "Apple" | "[\"Apple\", \"Apple\"]"             | "[\"A\", \"B\"]"            | "[1, 2]"        |
+---------+--------------------------------------+-----------------------------+-----------------+
| "Pear"  | "[\"Pear\", \"Pear\", ... \"Pear\"]" | "[\"A\", \"C\", ... \"D\"]" | "[2, 4, ... 8]" |
+---------+--------------------------------------+-----------------------------+-----------------+

Thanks for your answer. I ran your code but I can't figure out: 1) why you group on each column ? 2) how I am suppose to retrieve one dataframe per value from the grouping column. I updated the question with a rust sample to make it a bit clearer. — Jeremy Cochoy, Oct 13 '20 at 07:24
Ah, I misunderstood. I edited my answer, but I don't know if it helps. — loremdipso, Oct 13 '20 at 16:02

Partition by column value in rust using Arrow/Datafusion/Polars (like python panda's groupby)?

2 Answers2