0

Trying to recreate Jaccard similarity from Python in Rust Polars. The function compares two lists and returns a float.

Python method

def jacc_sim(x, y):
    i_c = len(set.intersection(*[set(x), set(y)]))
    u_c = len(set.union(*[set(x), set(y)]))
    return i_c / float(u_c)

Comparison of words in a list to strings in a df.

my_list = my_list = ["a", "ab", "abc", "abcd"]

df = pl.DataFrame({"keys" : ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"]}

Change the strings in the df to lists:

out = df.with_columns(pl.col("keys").apply(lambda x: x.split(" ")).alias("c"))

Apply a lambda with the jacc_sim function:

res = out.select([pl.col("keys"), pl.col("c").apply(lambda x: jacc_sim(x, my_list))
       .alias("d")]).sort("d", reverse=True)

Which gives the desired result in column "d":

shape: (4, 2)
┌────────────────────┬──────────┐
│ keys               ┆ d        │
│ ---                ┆ ---      │
│ str                ┆ f64      │
╞════════════════════╪══════════╡
│ a ab abc           ┆ 0.75     │
│ a ab               ┆ 0.5      │
│ b ba abc abcd      ┆ 0.333333 │
│ b ba bbc abcd bbcd ┆ 0.125    │
└────────────────────┴──────────┘

Rust method

More complicated. Firstly I couldn't find a way to use lists. But I came across HashSets:

fn jaccard<T>(s1: HashSet<T>, s2: HashSet<T>) -> f32 where T: Hash+Eq {
    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    return i / u;
}

And a function to create a HashSet from a vector:

fn vec_to_set(vec: &Vec<String>) -> HashSet<String> {
    HashSet::from_iter(vec.clone())
}

So the first bit will be:

let words = vec!["a", "ab", "abc", "abcd"];

let df = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]?;

let out = df.lazy().with_column(col("keys").str().split(" "));

UPDATE: thanks for the general comments that I frankly don't understand. I have mixed my attempt with the one for pyo3 from Ritchie for a fn to use as closure in the df. How to create HashSets of the same type?: s1 is HashSet<Option<&str>> and s2 is HashSet<String, {unknown}, {unknown}> And would the result produce a Series?

fn jaccard_similarity(sa: &Series, sb: Vec<String>) -> Series {
    let s1 = sa.utf8()
                .unwrap()
                .into_iter()
                .collect::<HashSet<_>>();
            
    let s2 = HashSet::from_iter(sb);

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result =  i / u;

    let s = result.collect::<Float32Chunked>().into_series();

}
fvg
  • 153
  • 3
  • 9
  • https://github.com/pola-rs/pyo3-polars/blob/main/example/extend_polars/src/parallel_jaccard_mod.rs – jqurious Mar 09 '23 at 14:25
  • Very interesting. I see Ritchie uses let s1 = a.into_iter().collect::>(); How does that work with a series of list[str]? – fvg Mar 09 '23 at 14:40
  • are you married to porting the function to rust, it seems you'd be better off just doing that work as expressions rather than as a python function – Dean MacGregor Mar 10 '23 at 18:39
  • Not sure what you mean. I am attempting to learn both Rust and Polars with recreating an example that I got working (and use) in Python. – fvg Mar 10 '23 at 21:30
  • @fvg in Rust you can put vectors in sets because... why not? In Python it wouldn't work because you might modify a list key, which would “misplace” it. But in Rust, Sets and Maps don't provide mutable access to their keys, and the borrow checker will prevent you from trying to sneakily store mutable references. (Yes, you can get around this with interior mutability, but... don't.) So there is no reason to *not* make `Vec: Hash`. – BallpointBen Mar 19 '23 at 01:29

0 Answers0