Trying to recreate Jaccard similarity from Python in Rust Polars. The function compares two lists and returns a float.
Python method
def jacc_sim(x, y):
i_c = len(set.intersection(*[set(x), set(y)]))
u_c = len(set.union(*[set(x), set(y)]))
return i_c / float(u_c)
Comparison of words in a list to strings in a df.
my_list = my_list = ["a", "ab", "abc", "abcd"]
df = pl.DataFrame({"keys" : ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"]}
Change the strings in the df to lists:
out = df.with_columns(pl.col("keys").apply(lambda x: x.split(" ")).alias("c"))
Apply a lambda with the jacc_sim function:
res = out.select([pl.col("keys"), pl.col("c").apply(lambda x: jacc_sim(x, my_list))
.alias("d")]).sort("d", reverse=True)
Which gives the desired result in column "d":
shape: (4, 2)
┌────────────────────┬──────────┐
│ keys ┆ d │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════════════════╪══════════╡
│ a ab abc ┆ 0.75 │
│ a ab ┆ 0.5 │
│ b ba abc abcd ┆ 0.333333 │
│ b ba bbc abcd bbcd ┆ 0.125 │
└────────────────────┴──────────┘
Rust method
More complicated. Firstly I couldn't find a way to use lists. But I came across HashSets:
fn jaccard<T>(s1: HashSet<T>, s2: HashSet<T>) -> f32 where T: Hash+Eq {
let i = s1.intersection(&s2).count() as f32;
let u = s1.union(&s2).count() as f32;
return i / u;
}
And a function to create a HashSet from a vector:
fn vec_to_set(vec: &Vec<String>) -> HashSet<String> {
HashSet::from_iter(vec.clone())
}
So the first bit will be:
let words = vec!["a", "ab", "abc", "abcd"];
let df = df! [
"keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
]?;
let out = df.lazy().with_column(col("keys").str().split(" "));
UPDATE:
thanks for the general comments that I frankly don't understand. I have mixed my attempt with the one for pyo3 from Ritchie for a fn to use as closure in the df.
How to create HashSets of the same type?: s1 is HashSet<Option<&str>>
and s2 is HashSet<String, {unknown}, {unknown}>
And would the result produce a Series?
fn jaccard_similarity(sa: &Series, sb: Vec<String>) -> Series {
let s1 = sa.utf8()
.unwrap()
.into_iter()
.collect::<HashSet<_>>();
let s2 = HashSet::from_iter(sb);
let i = s1.intersection(&s2).count() as f32;
let u = s1.union(&s2).count() as f32;
let result = i / u;
let s = result.collect::<Float32Chunked>().into_series();
}