Selecting rows by id and calculating the mean value in Polars with Rust

Question

I have the following dataframe and a vector of names.

name	age
panda	5
polarbear	7
seahorse	1

I would like to select rows by the names in the vector and calculate the average age of selected rows. I have the following code:

let names = vec!["panda", "seahorse"];
let avg = df.lazy()
    .select([col("name").filter(|c| names.contains(c))])
    .agg([col("age").mean()]);

Intuition says, pass a function to the filter (like I have done), however this is wrong. Apparently there is some sort of Expr API in play. How does it work? I find the docs a bit puzzling.

Does this help? https://pola-rs.github.io/polars-book/user-guide/dsl/expressions.html#filter-and-conditionals — isaactfa, Nov 11 '22 at 16:34
They present an API where you can select columns using regular expressions but I have an arbitrary list of names instead. There must be some simple way of doing it I am missing. — xosxos, Nov 11 '22 at 17:08

score 1 · Answer 1 · answered Nov 11 '22 at 18:47

I'm not sure there's a straightforward way to filter the way you're wanting to. One option you do have is to create a separate DataFrame containing the names you want to filter on and perform a join:

let filter_df = df! [
    "name" => ["panda", "seahorse"]
].unwrap();

let avg = df.lazy()
    .join(filter_df.lazy(), [col("name")], [col("name")], JoinType::Inner)
    .select([col("age").mean()]);

It was unclear whether you wanted to perform the mean on all the ages returned from this filtering (this is how it was worded in your question), if that's the case then this code should work for you. This is the output:

Ok(shape: (1, 1)
┌─────┐
│ age │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
└─────┘)

Your code makes it seem like you were trying to calculate the mean for all pandas vs the mean for all seahorses. If that's the case, then you can use a .groupby().agg() chain:

let avg = df.lazy()
    .join(names_df.lazy(), [col("name")], [col("name")], JoinType::Inner)
    .groupby([col("name")])
    .agg([col("age").mean()]);

Which produces:

Ok(shape: (2, 2)
┌──────────┬─────┐
│ name     ┆ age │
│ ---      ┆ --- │
│ str      ┆ f64 │
╞══════════╪═════╡
│ seahorse ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ panda    ┆ 5.0 │
└──────────┴─────┘)

Thank you, it works! I think the is_in and creating a Series and then an Expr by BallpointBen is a bit more fitting for my use case, but you also got my slightly stupid example correct and helped with the avg from all the selected names and not groups! — xosxos, Nov 12 '22 at 08:11

BallpointBen · Accepted Answer · 2022-11-11T22:55:03.487

If you use feature is_in, you can filter directly on a column being present in another Expr, which leads to the following.

fn main() -> Result<(), PolarsError> {
    // Added another panda to see the effect of groupby-mean
    let df = df! [
        "name" => ["panda", "panda", "polarbear", "seahorse"],
        "age" => [5, 55, 7, 1],
    ]?;

    let names = vec!["panda", "seahorse"];
    let df = df
        .lazy()
        .filter(col("name").is_in(lit(Series::from_iter(names))))
        .groupby([col("name")])
        .agg([col("age").mean()]);
    println!("{:?}", df.collect()?);

    Ok(())
}

┌──────────┬──────┐
│ name     ┆ age  │
│ ---      ┆ ---  │
│ str      ┆ f64  │
╞══════════╪══════╡
│ panda    ┆ 30.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ seahorse ┆ 1.0  │
└──────────┴──────┘

Thank you, I got it working! Indeed for people wondering, the is_in feature must be added to Cargo.toml. `polars = { version = "*", features = ["lazy", "is_in"] }` — xosxos, Nov 12 '22 at 10:22

Selecting rows by id and calculating the mean value in Polars with Rust

2 Answers2