I have a huge database with a ton of nodes (10mil+). There is only one type of relationship in the whole database. However, there are a ton of nodes that have duplicated relationships between them. What i have currently is this cypher script that finds all the pairs with duplicates, and then a python script that runs through and cleans up each one (leaving just one unique relationship between those nodes).
match (a)-[r]->(b) with a,b, count(*) as c where c>1 return a.pageid, b.pageid, c LIMIT 100000;
this works fairly well for a small database, but when i run it on a big one it eventually blows up with an exception about running out of memory on the heap (bumping up the box more and more doesn't help).
So, the question is 2-fold: 1) Is there any sort of indexing i can put on relationships (right now there is none) that would help speed this up? 2) Is there a cypher query that can (in a fast manner... or at least reliably) delete all the duplicate relationships in the database leaving just one unique one for each node pair (that already has relationship between them)?
P.S. I'm running neo4j 2.0.1 on an ubuntu (12something) AWS box.
P.P.S. I realize there is this answer: stackoverflow, however what he's asking is something more specific (against 2 already known nodes), and the answer that has full database covered doesn't run anymore (syntax change?)
Thanks in advance!