1

I'm basing my question is this previous one, that missed sample data and desired results:

I want to write a UDF in Snowflake that can compute Jaccard similarity between two arrays:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)

select jaccard_sim(a, b)
from data

The desired results are 0.6 and 0.5, for the previous two examples.

Definition: https://en.wikipedia.org/wiki/Jaccard_index

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325

1 Answers1

0

I wrote a JS UDF to perform the desired computation:

create or replace function jaccard_sim(A array, B array)
returns string
language javascript
as $$
var union = new Set([...A, ...B]).size;

var intersection = new Set(
  Array.from(new Set(A)).filter(x => new Set(B).has(x))
).size;

return intersection/union

$$;

With this, select jaccard_sim(a, b) from data will work as expected.

I got the set operations for JS from https://exploringjs.com/impatient-js/ch_sets.html#union-a-b.


The UDF above solves the problem. As a bonus, this is how the native Snowflake approximate_similarity/approximate_jaccard_index works:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)


select approximate_similarity(mh), seq, array_agg(arr)
from (
    select minhash(1023, value) mh, seq, any_value(a) arr
    from data, table(flatten(a))
    group by seq
    union all
    select minhash(1023, value) mh, seq, any_value(b) arr
    from data, table(flatten(b))
    group by seq
)
group by seq

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    When doing on mass, it may benefit passing through arrays_overlap() prior. This is a way more effecient function that discards arrays that cannot be similar. This may reduce credit usage and speed things up. https://docs.snowflake.com/en/sql-reference/functions/arrays_overlap.html – Adrian White Jul 26 '22 at 08:55
  • 1
    I'm wondering if doing the extra check adds computation time, in cases where the user expect most rows to have overlap – Felipe Hoffa Jul 26 '22 at 17:41
  • Absolutely there is extra computation as you know, it's detection of overlaps vs no overlap. I've found in most of my cases it's absolutely worth it (this is dependent on the size of each array as well - my arrays were extremely large). The only thing I'd add is that data changes when you're not looking so some sort of regular testing maybe worthy - however your datasets size, rate of change etc etc comes into play. – Adrian White Jul 27 '22 at 01:30
  • 1
    @FelipeHoffa I ended up solving the problem with using array_intersection and just calculating the union with taking the size(A) + size(B) - size(A n B). Its not a great solution for very large arrays but worked for my use case. – SriK Jul 27 '22 at 02:28
  • Oh, be careful with that solution, as you need to remove duplicates from the union – Felipe Hoffa Jul 27 '22 at 02:45