0

I've been advised to use apoc.periodic.commit to batch up a large query I'm running in neo4j. My code below doesn't seem to be batching up and committing though after each step though. Server runs out of memory, which I think it shouldn't if commits after each single item.

I'm calculating the jaccard indexes of a set of nodes (here I name the property paradig for "paradigmatic relation" since this is a set of next-word relationships in text).

Computing this for each node is quite a large job. I'm calculating for 53 nodes, but the whole population is about 60k, and this is a n^2 operation. If I run it in a single transaction I run out of memory. So I want to run it in batches, commiting once each index has been calculated. I have marked the nodes I need to process with a property toProcess and I'm running the code below to compute the jaccard indexes

1) Am I just using apoc wrong?

2) is there a better, more neo4j-centric way of doing this. I have always worked with SQL.

call apoc.periodic.commit("
MATCH (s:Word{toProcess: True})
MATCH (w:Word)-[:NEXT_WORD]->(s)
WITH collect(DISTINCT w.name) as left1, s
MATCH (w:Word)<-[:NEXT_WORD]-(s)
WITH left1, s, collect(DISTINCT w.name) as right1
// Match every other word
MATCH (o:Word) WHERE NOT s = o
WITH left1, right1, s, o
// Get other right, other left1
MATCH (w:Word)-[:NEXT_WORD]->(o)
WITH collect(DISTINCT w.name) as left1_o, s, o, right1, left1
MATCH (w:Word)<-[:NEXT_WORD]-(o)
WITH left1_o, s, o, right1, left1, collect(DISTINCT w.name) as right1_o
// compute right1 union, intersect
WITH FILTER(x IN right1 WHERE x IN right1_o) as r1_intersect,
  (right1 + right1_o) AS r1_union, s, o, right1, left1, right1_o, left1_o
// compute left1 union, intersect
WITH FILTER(x IN left1 WHERE x IN left1_o) as l1_intersect,
  (left1 + left1_o) AS l1_union, r1_intersect, r1_union, s, o
WITH DISTINCT r1_union as r1_union, l1_union as l1_union, r1_intersect, l1_intersect, s, o
WITH 1.0*size(r1_intersect) / size(r1_union) as r1_jaccard,
  1.0*size(l1_intersect) / size(l1_union) as l1_jaccard,
  s, o
WITH s, o, r1_jaccard, l1_jaccard, r1_jaccard + l1_jaccard as sim
MERGE (s)-[r:RELATED_TO]->(o) SET r.paradig = sim
set s.toProcess = false
",{batchSize:1, parallel:false})

rationale:

batchSize:1: I want it to commit after each jaccard index is set

parallel:false: I want serial operation so I don't run out of memory

Ashley Mills
  • 50,474
  • 16
  • 129
  • 160
DanBennett
  • 448
  • 2
  • 5
  • 17
  • 1
    This does not apply to the topic, and I can not compare it now with your query, but it's a snippet for calculating the jaccard that I use: https://pastebin.com/raw/9tX2axxD / Maybe it will be useful to you)) – stdob-- May 04 '18 at 19:07
  • thanks - looks easier to follow than mine :) – DanBennett May 05 '18 at 10:16

1 Answers1

1

I've got this working using apoc.periodic.iterate rather than apoc.periodic.commit as below

I have marked this as the correct answer because a fair amount of time has passed since asking. I'm not convinced there isn't a better way, though.

I found it difficult to find out best practice to batching updates like this in neo4j, and I'm not enough of an expert myself to know if this is best, (or even halfway decent) practice

call apoc.periodic.iterate("

MATCH (s:Word) where s.toProcess=true
return s", 
"MATCH (w:Word)-[:NEXT_WORD]->(s)
WITH collect(DISTINCT w.name) as left1, s
MATCH (w:Word)<-[:NEXT_WORD]-(s)
WITH left1, s, collect(DISTINCT w.name) as right1
// Match every other word
MATCH (o:Word) WHERE NOT s = o
WITH left1, right1, s, o
// Get other right, other left1
MATCH (w:Word)-[:NEXT_WORD]->(o)
WITH collect(DISTINCT w.name) as left1_o, s, o, right1, left1
MATCH (w:Word)<-[:NEXT_WORD]-(o)
WITH left1_o, s, o, right1, left1, collect(DISTINCT w.name) as right1_o
// compute right1 union, intersect
WITH FILTER(x IN right1 WHERE x IN right1_o) as r1_intersect,
  (right1 + right1_o) AS r1_union, s, o, right1, left1, right1_o, left1_o
// compute left1 union, intersect
WITH FILTER(x IN left1 WHERE x IN left1_o) as l1_intersect,
  (left1 + left1_o) AS l1_union, r1_intersect, r1_union, s, o
WITH DISTINCT r1_union as r1_union, l1_union as l1_union, r1_intersect, l1_intersect, s, o
WITH 1.0*size(r1_intersect) / size(r1_union) as r1_jaccard,
  1.0*size(l1_intersect) / size(l1_union) as l1_jaccard,
  s, o
WITH s, o, r1_jaccard, l1_jaccard, r1_jaccard + l1_jaccard as sim
MERGE (s)-[r:RELATED_TO]->(o) SET r.paradig = sim
set s.toProcess = false",
{batchSize:1})
yield batches, total return batches, total
DanBennett
  • 448
  • 2
  • 5
  • 17
  • 1
    3 years later and it's still very difficult to find clear answers. Thanks for attempting this. I'm doing something similar, except the data set is ~4 million nodes, so... :| – p e p Nov 09 '21 at 00:27