2

I got a problem creating triadic closures for a huge amount of nodes and relationships. I used to search for an answer for hours, but nothing really matched my problem.


The dataset:

  • 322276 nodes with label PERSON (with index on property name)
  • 987052 nodes with label PRODUCTION
  • 6417928 relationships with label PLAYS
  • 14314487 relationships with label WORKS

The nodes are connected as follows:

  • (:PERSON)-[:PLAYS]->(:PRODUCTION)
  • (:PERSON)-[:WORKS]->(:PRODUCTION)

I want to create triadic closures between all persons, that means connect two persons that worked on / played in the same production with a new edge with label [:WORKED_IN]. To do so I wrote the following query:

MATCH (p1:PERSON)
-[:WORKS|PLAYS*2..2]-
(p2:PERSON)
WHERE p1<>p2
CREATE UNIQUE (p1)-[:WORKED_WITH]->(p2);

Instead of CREATE UNIQUE I tried to MERGE and to use WHERE NOT (p1)-[:WORKED_WITH]->(p2). The problem is that even after 7 hours it does not finish... I know this is a huge amount of data, but I hope there is different way to have this much quicker...

Do you have any idea what to do?

Some more information:


  • Neo4j 3.1.4 Community Edition
  • Windows 10
  • Quad Core i5
  • 8GB RAM DDR3
  • located on a SSD drive
  • I did not change the default config of neo4j

I also thought about trying to use the traversal API, but I don't know how to do this (and also if this would help)... I already read some books from Michael Hunger, Vukotic/Watt, Panzarino, etc., studied the official docs and read many answers on stackoverflow, but did not find useful information. I hope you can help me.


Best Wishes, Wolfgang

  • (1) When you used `MERGE (p1)-[:WORKED_WITH]->(p2)`, you should NOT have also used `WHERE NOT (p1)-[:WORKED_WITH]->(p2)`, since `MERGE` automatically does the same test, duplicating the effort. (2) Do you actually *need* the redundant `WORKED_WITH` relationship, which requires much more code complexity and storage in order to add those relationships whenever related changes are made to the DB? Couldn't you just use something like your existing `MATCH` and `WHERE` clauses to find who worked with whom as needed? – cybersam Jun 21 '17 at 05:40
  • Well, I also tried with merge and without that where part (taking also many days). I do want to do such things like clustering. There I need for every pair of Nodes a similarity (e.g. jaccard)... It is just not performing if I don't create those edges (with a weight)... – Wolfgang G. Jun 21 '17 at 05:52

2 Answers2

2

When doing refactoring or updating a big graph you want to use batching. The Apoc library provides such an option with apoc.periodic.

That would look in your example like:

call apoc.periodic.commit("
MATCH (p1:PERSON)-[:WORKS|PLAYS*2..2]-(p2:PERSON)
WHERE id(p1) < id(p2) and NOT (p1)-[:WORKED_WITH]-(p2)
with p1,p2 limit {limit}
MERGE (p1)-[:WORKED_WITH]-(p2);
RETURN count(*)
",{limit:5000})
Tomaž Bratanič
  • 6,319
  • 2
  • 18
  • 31
  • Thank you really much! I read about the Apoc library in an answer of Michael Hunger, but I did not know it can commit periodicly. I have to admit that I thought about using Apoc to export as CSV and import it using Cypher, but I thought that maybe there is a way provided by neo4j itself... Am I right that there isn't? – Wolfgang G. Jun 01 '17 at 10:54
  • no for now i havent seen any good native cypher batching options. `apoc.periodic.iterate` and `apoc.periodic.commit` fit nicely for batching – Tomaž Bratanič Jun 01 '17 at 10:58
  • Thank you very much for your support. I will try that and share my result. – Wolfgang G. Jun 01 '17 at 11:01
  • The apoc.periodic.commit works fine for my issue, although it takes much time (but this is not surprising due to my graph). Thank you again! – Wolfgang G. Jun 02 '17 at 07:06
1

Just if someone will ever read this question, using the apoc library did not solve my problem satisfying... It would have taken about a month.

So I decided to run a SQL-Statement to export the triadic closures to a CSV-file (in Neo4j I created unique constraints on the IDs of the SQL tables). Those CSV data is easily imported via bulk loading. All together took me less than 3 hours, so I would say it is the most efficient way to solve this.