1

I have multiple nodes in my Neo4j graph. I want to create relationship between any 2 nodes, if and only if, their Jaccard similarity on their attributes is above some threshold alpha.

Consider 2 nodes:

Node 1: {id:1, abc: 1.1, eww: -9.4, ssv: "likj"}
Node 2: {id:2, we2: 1, eww: 900}
Node 3: {id:3, kuku: -91, lulu: 383, ssv: "bubu"}

So Node1 and Node2 Jaccard similarity on their attributes would be: (intersection =) 2/ (union =) 5 = 0.4

How can I do this in Neo4j? I know there is a Jaccard similarity function, but how to config it to work on the attributes of the nodes?

SteveS
  • 3,789
  • 5
  • 30
  • 64
  • 1
    Just for clarity, you mean the Jaccard similarity based on the presence of those attributes, not their values, right? – Simon Thordal Nov 09 '21 at 08:12
  • @SimonThordal yep, I need the Jaccard similarity on the attributes and if it's above a threshold then draw a relationship. – SteveS Nov 09 '21 at 13:40

1 Answers1

1

Assuming you mean the Jaccard similarity of the presence of properties then you could do something like this

MATCH (a:Node)
MATCH (b:Node) WHERE id(b) > id(a)
WITH a, b, [prop IN keys(a) WHERE prop IN keys(b)] AS shared_properties // Find the properties that exist on both nodes using the IN operator
WITH a, b, size(shared_properties) AS shared_property_count // Get the number of shared properties 
WITH 1.0*shared_property_count / size(apoc.coll.union(keys(a), keys(b))) AS jaccard_similarity, a, b // Compute the Jaccard similarity as the intersection over the union
WHERE jaccard_similarity > $threshold // Make sure the similarity is higher than some threshold
CREATE (a)-[:SIMILAR_TO {jaccard: jaccard_similarity}]->(b) 

The WITH statements find the properties that are present on both nodes and counts them and in the end we find the Jaccard similarity.

SteveS
  • 3,789
  • 5
  • 30
  • 64
Simon Thordal
  • 893
  • 10
  • 28
  • I have tried it but it didn't work for me, trying to play with it. Can you please explain the solution? – SteveS Nov 09 '21 at 13:38
  • 1
    I've added a few explanatory comments. In what way doesn't it work at the moment? – Simon Thordal Nov 09 '21 at 13:46
  • Why in ```WITH 1.0*shared_property_count / size(keys(a)) AS jaccard_similarity, a, b // Compute the Jaccard similarity as the intersection over the union``` you don't apply union on keys of a and b nodes? In Jaccard it's union of a and b nodes attributes. @simon-thordal – SteveS Nov 09 '21 at 13:52
  • And why do it's ```AS jaccard_similarity, a, b``` why ```a```, ```b```? – SteveS Nov 09 '21 at 13:54
  • 1
    You're right about the union, I was missing that. It is easy enough to fix, you can get a list of the properties on both as `keys(a) + keys(b)` and you will just need to make that list unique – Simon Thordal Nov 09 '21 at 13:57
  • 1
    The reason we're carrying over the `a, b` in the `WITH` statement is so that we can create the relationship on the last line. If we didn't the variables would go out of scope. – Simon Thordal Nov 09 '21 at 13:58
  • Isn't there any builtin APOC function to do this for every 2 nodes in the graph? It seems like classical function to be there. – SteveS Nov 10 '21 at 08:32