How do I find the set intersection of a column of lists?
[dependencies]
polars = { version = "*", features = ["lazy"] }
use polars::df;
use polars::prelude::*;
fn main() {
let df = df![
"bar" => ["a", "b", "c", "a", "b", "c", "a", "c"],
"ham" => ["foo", "foo", "foo", "bar", "bar", "bar", "bing", "bang"]
]
.unwrap();
let df_grp = df
.lazy()
.groupby(["bar"])
.agg([col("ham").list()])
.collect()
.unwrap();
println!("{:?}", df_grp);
}
prints:
┌─────┬────────────────────────┐
│ bar ┆ ham │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════╪════════════════════════╡
│ c ┆ ["foo", "bar", "bang"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ ["foo", "bar"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ ["foo", "bar", "bing"] │
└─────┴────────────────────────┘
What I would like to do is do a set intersection of rows a/b/c ⇒ ["foo","bar"] as the common strings in all rows.
My though was to turn the column of lists of string to a column of hashsets and then fold/reduce the intersection. How do I go from Series<list<String>>
⇒ Series<HashSet>
? If this is possible in a lazyframe fold expression, that would be great but how to define the accumulator? lit(HashSet)?