0

I have the following dataframe and a vector of names.

name age
panda 5
polarbear 7
seahorse 1

I would like to select rows by the names in the vector and calculate the average age of selected rows. I have the following code:

let names = vec!["panda", "seahorse"];
let avg = df.lazy()
    .select([col("name").filter(|c| names.contains(c))])
    .agg([col("age").mean()]);

Intuition says, pass a function to the filter (like I have done), however this is wrong. Apparently there is some sort of Expr API in play. How does it work? I find the docs a bit puzzling.

xosxos
  • 167
  • 1
  • 6
  • is this API doing Regular Expression validation? – easleyfixed Nov 11 '22 at 16:02
  • Does this help? https://pola-rs.github.io/polars-book/user-guide/dsl/expressions.html#filter-and-conditionals – isaactfa Nov 11 '22 at 16:34
  • They present an API where you can select columns using regular expressions but I have an arbitrary list of names instead. There must be some simple way of doing it I am missing. – xosxos Nov 11 '22 at 17:08

2 Answers2

1

I'm not sure there's a straightforward way to filter the way you're wanting to. One option you do have is to create a separate DataFrame containing the names you want to filter on and perform a join:

let filter_df = df! [
    "name" => ["panda", "seahorse"]
].unwrap();

let avg = df.lazy()
    .join(filter_df.lazy(), [col("name")], [col("name")], JoinType::Inner)
    .select([col("age").mean()]);

It was unclear whether you wanted to perform the mean on all the ages returned from this filtering (this is how it was worded in your question), if that's the case then this code should work for you. This is the output:

Ok(shape: (1, 1)
┌─────┐
│ age │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
└─────┘)

Your code makes it seem like you were trying to calculate the mean for all pandas vs the mean for all seahorses. If that's the case, then you can use a .groupby().agg() chain:

let avg = df.lazy()
    .join(names_df.lazy(), [col("name")], [col("name")], JoinType::Inner)
    .groupby([col("name")])
    .agg([col("age").mean()]);

Which produces:

Ok(shape: (2, 2)
┌──────────┬─────┐
│ name     ┆ age │
│ ---      ┆ --- │
│ str      ┆ f64 │
╞══════════╪═════╡
│ seahorse ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ panda    ┆ 5.0 │
└──────────┴─────┘)
emagers
  • 841
  • 7
  • 13
  • Thank you, it works! I think the is_in and creating a Series and then an Expr by BallpointBen is a bit more fitting for my use case, but you also got my slightly stupid example correct and helped with the avg from all the selected names and not groups! – xosxos Nov 12 '22 at 08:11
1

If you use feature is_in, you can filter directly on a column being present in another Expr, which leads to the following.

fn main() -> Result<(), PolarsError> {
    // Added another panda to see the effect of groupby-mean
    let df = df! [
        "name" => ["panda", "panda", "polarbear", "seahorse"],
        "age" => [5, 55, 7, 1],
    ]?;

    let names = vec!["panda", "seahorse"];
    let df = df
        .lazy()
        .filter(col("name").is_in(lit(Series::from_iter(names))))
        .groupby([col("name")])
        .agg([col("age").mean()]);
    println!("{:?}", df.collect()?);

    Ok(())
}
┌──────────┬──────┐
│ name     ┆ age  │
│ ---      ┆ ---  │
│ str      ┆ f64  │
╞══════════╪══════╡
│ panda    ┆ 30.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ seahorse ┆ 1.0  │
└──────────┴──────┘
BallpointBen
  • 9,406
  • 1
  • 32
  • 62
  • Thank you, I got it working! Indeed for people wondering, the is_in feature must be added to Cargo.toml. `polars = { version = "*", features = ["lazy", "is_in"] }` – xosxos Nov 12 '22 at 10:22