I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:
- Only projecting relationships with edge weights greater than 1
- Deleting the core node to increase the number of components (my graph is about a single topic, with the core node representing that topic)
- Changing it to an undirected graph
I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?
*** EDITED TO ADD *** This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:
(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post {Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post {Type:Tweet})-[QUOTE_TWEETED]->(p2:Post {Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
For my code, I've tried first projecting only AT_MENTIONED relationships:
- CALL gds.graph.create('similarity_graph', ["Entity", "Post"], "AT_MENTIONED")
I've tried doing that with a reversed orientation:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"], {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})
I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...
MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength
...and then projecting that:
CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")
Whichever one of the above I try, I then get my Jaccard distribution by running:
CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution