I've been advised to use apoc.periodic.commit to batch up a large query I'm running in neo4j. My code below doesn't seem to be batching up and committing though after each step though. Server runs out of memory, which I think it shouldn't if commits after each single item.
I'm calculating the jaccard indexes of a set of nodes (here I name the property paradig
for "paradigmatic relation" since this is a set of next-word relationships in text).
Computing this for each node is quite a large job. I'm calculating for 53 nodes, but the whole population is about 60k, and this is a n^2 operation. If I run it in a single transaction I run out of memory. So I want to run it in batches, commiting once each index has been calculated. I have marked the nodes I need to process with a property toProcess
and I'm running the code below to compute the jaccard indexes
1) Am I just using apoc wrong?
2) is there a better, more neo4j-centric way of doing this. I have always worked with SQL.
call apoc.periodic.commit("
MATCH (s:Word{toProcess: True})
MATCH (w:Word)-[:NEXT_WORD]->(s)
WITH collect(DISTINCT w.name) as left1, s
MATCH (w:Word)<-[:NEXT_WORD]-(s)
WITH left1, s, collect(DISTINCT w.name) as right1
// Match every other word
MATCH (o:Word) WHERE NOT s = o
WITH left1, right1, s, o
// Get other right, other left1
MATCH (w:Word)-[:NEXT_WORD]->(o)
WITH collect(DISTINCT w.name) as left1_o, s, o, right1, left1
MATCH (w:Word)<-[:NEXT_WORD]-(o)
WITH left1_o, s, o, right1, left1, collect(DISTINCT w.name) as right1_o
// compute right1 union, intersect
WITH FILTER(x IN right1 WHERE x IN right1_o) as r1_intersect,
(right1 + right1_o) AS r1_union, s, o, right1, left1, right1_o, left1_o
// compute left1 union, intersect
WITH FILTER(x IN left1 WHERE x IN left1_o) as l1_intersect,
(left1 + left1_o) AS l1_union, r1_intersect, r1_union, s, o
WITH DISTINCT r1_union as r1_union, l1_union as l1_union, r1_intersect, l1_intersect, s, o
WITH 1.0*size(r1_intersect) / size(r1_union) as r1_jaccard,
1.0*size(l1_intersect) / size(l1_union) as l1_jaccard,
s, o
WITH s, o, r1_jaccard, l1_jaccard, r1_jaccard + l1_jaccard as sim
MERGE (s)-[r:RELATED_TO]->(o) SET r.paradig = sim
set s.toProcess = false
",{batchSize:1, parallel:false})
rationale:
batchSize:1
: I want it to commit after each jaccard index is set
parallel:false
: I want serial operation so I don't run out of memory