Any way to filter out the most frequent terms in Neo4J APOC request?

Question

I have the following request:

CALL apoc.index.relationships('TO','context:34b4a5b0-0dfa-11e9-98ed-7761a512a9c0') 
 YIELD rel, start, end WITH DISTINCT rel, start, end 
 RETURN DISTINCT start.uid AS source_id, 
 start.name AS source_name, 
 end.uid AS target_id, 
 end.name AS target_name, 
 rel.uid AS edge_id, 
 rel.context AS context_id, 
 rel.statement AS statement_id, 
  rel.weight AS weight

Which returns a table of results such as

The question: Is there a way to filter out the top 150 most connected nodes (source_name/source_id and target_name/edge_id nodes)?

I don't think it would work with frequency as each table row is unique (because of the different edge_id) but maybe there's a function inside Neo4J / Cypher that allows me to count the top most frequent (source_name/source_id and target_name/edge_id) nodes?

Thank you!

Are you looking for the 50 most common `start/end` node pairs? — cybersam, Jan 14 '19 at 21:24
@cybersam yes, exactly. the 50 or 150 most frequently occurring pairs. — Aerodynamika, Jan 15 '19 at 03:06

score 1 · Answer 1 · answered Jan 14 '19 at 01:00

1

You could always use size( (node)-[:REL]->() ) to get the degree.

And if you compute the top-n degree's first you can filter those out by comparing

WHERE min < size( (node)-[:REL]->() ) < max

answered Jan 14 '19 at 01:00

Michael Hunger

41,339
3
57
80

Thank you, Michael. But how do I integrate it into my query? Can I add `WHERE size(rel)` right after the `WITH DISTINCT` part of the query? And I'm still not clear how I filter out the top 150 ones... Will be great if you could clarify that! – Aerodynamika Jan 14 '19 at 10:54

cybersam · Accepted Answer · 2019-01-15T18:47:04.280

1

This query might do what you want:

CALL apoc.index.relationships('TO','context:34b4a5b0-0dfa-11e9-98ed-7761a512a9c0') 
YIELD rel, start, end
WITH start, end, COLLECT(rel) AS rs
ORDER BY SIZE(rs) DESC LIMIT 50
RETURN
  start.uid AS source_id, 
  start.name AS source_name, 
  end.uid AS target_id, 
  end.name AS target_name,
  [r IN rs | {edge_id: r.uid, context_id: r.context, statement_id: r.statement, weight: r.weight}] AS rels

The query uses the aggregating function COLLECT to collect all the relationships for each pair of start/end nodes, keeps the data for the 50 node pairs with the most relationships, and returns a row of data for each pair (with the data for the relationships in a rels list).

edited Jan 15 '19 at 18:47

answered Jan 15 '19 at 18:30

cybersam

63,203
6
53
76

Thank you, it works! Do I understand correctly that it counts how `rels` have the most similar `start` and `end` node pairs and then filters out the top 50 of them? Can I somehow integrate the `weight` parameter of the relationship `rel.weight` into this calculation? – Aerodynamika Jan 25 '19 at 21:24
The only problem is that it filters out the top 150 relationships, so that's not necessarily 150 nodes and my graph is directed... I wonder how it could be possible to filter out the top 150 nodes with the highest degree from those results and then show the relationships for them... – Aerodynamika Jan 25 '19 at 21:34

Any way to filter out the most frequent terms in Neo4J APOC request?

2 Answers2