6

I have a huge database with a ton of nodes (10mil+). There is only one type of relationship in the whole database. However, there are a ton of nodes that have duplicated relationships between them. What i have currently is this cypher script that finds all the pairs with duplicates, and then a python script that runs through and cleans up each one (leaving just one unique relationship between those nodes).

match (a)-[r]->(b) with a,b, count(*) as c where c>1 return a.pageid, b.pageid, c LIMIT 100000;

this works fairly well for a small database, but when i run it on a big one it eventually blows up with an exception about running out of memory on the heap (bumping up the box more and more doesn't help).

So, the question is 2-fold: 1) Is there any sort of indexing i can put on relationships (right now there is none) that would help speed this up? 2) Is there a cypher query that can (in a fast manner... or at least reliably) delete all the duplicate relationships in the database leaving just one unique one for each node pair (that already has relationship between them)?

P.S. I'm running neo4j 2.0.1 on an ubuntu (12something) AWS box.

P.P.S. I realize there is this answer: stackoverflow, however what he's asking is something more specific (against 2 already known nodes), and the answer that has full database covered doesn't run anymore (syntax change?)

Thanks in advance!

Community
  • 1
  • 1
Diaspar
  • 567
  • 1
  • 5
  • 12
  • 1
    Just some thoughts: Have you attempted smaller batches, maybe 100 at a time, passing those to your python script? (not sure you need to grab 100K per pass) Do you have indexes on your nodes, where you could run this operation against specific node types, reducing total node space? – David Makogon Apr 10 '14 at 18:58

2 Answers2

8

What error do you get with the db global query in the linked SO question? Try substituting | for : in the FOREACH, that's the only breaking syntax difference that I can see. The 2.x way to say the same thing, except adapted to your having only one relationship type in the db, might be

MATCH (a)-[r]->(b)
WITH a, b, TAIL (COLLECT (r)) as rr
FOREACH (r IN rr | DELETE r)

I think the WITH pipe will carry the empty tails when there is no duplicate, and I don't know how expensive it is to loop through an empty collection–my sense is that the place to introduce the limit is with a filter after the WITH, something like

MATCH (a)-[r]->(b)
WITH a, b, TAIL (COLLECT (r)) as rr
WHERE length(rr) > 0 LIMIT 100000
FOREACH (r IN rr | DELETE r)

Since this query doesn't touch properties at all (as opposed to yours, which returns properties for (a) and (b)) I don't think it should be very memory heavy for a medium graph like yours, but you will have to experiment with the limit.

If memory is still a problem, then if there is any way for you to limit the nodes to work with (without touching properties), that's also a good idea. If your nodes are distinguishable by label, try running the query for one label at the time

MATCH (a:A)-[r]->(b) //etc..
MATCH (a:B)-[r]->(b) //etc..
jjaderberg
  • 9,844
  • 34
  • 34
  • 1
    how does that FOREACH know to only kill the extras and leave 1 (like, if there's 3 relationships that are the same, leave 1 and kill 2). Just trying to understand before i run this on db (took 2 weeks to import this data :-/ – Diaspar Apr 10 '14 at 20:02
  • 1
    Do test it first, just setup a mock db at http://console.neo4j.org or something. The reason is that you only carry the tail of the collection, i.e. all but the first, so the first will be untouched by the foreach. – jjaderberg Apr 10 '14 at 20:14
  • 2
    it worked!!!! previous efforts would linger for hours and eventually bomb out with a heap "out of memory" exception. this thing completed in 2.5 minutes!! – Diaspar Apr 10 '14 at 20:51
  • 1
    getting syntax errors with this a year later. Neo4j doesn't like the `LIMIT 100000` for some reason – Monica Heddneck Jul 29 '16 at 03:41
2

This is a version of the accepted answer that has been fixed (by inserting the WITH rr clause) to work with more recent neo4j versions, and which should be faster (since it only creates the new TAIL list when needed):

MATCH (a)-[r]->(b)
WITH a, b, COLLECT(r) AS rr
WHERE SIZE(rr) > 1
WITH rr
LIMIT 100000
FOREACH (r IN TAIL(rr) | DELETE r);

[UPDATE]

If you only want to delete duplicate relationships with the same type, then do this:

MATCH (a)-[r]->(b)
WITH a, b, TYPE(r) AS t, COLLECT(r) AS rr
WHERE SIZE(rr) > 1
WITH rr
LIMIT 100000
FOREACH (r IN TAIL(rr) | DELETE r);
Benjamin R
  • 555
  • 6
  • 25
cybersam
  • 63,203
  • 6
  • 53
  • 76