0

Suppose I have 3 subgraphs in Neo4j and I would like to select and delete the whole subgraph if all the nodes in the subgraph matching the filtering criteria that is each node's property value <= 1. However if there is atleast one node within the subgraph that is not matching the criteria then the subgraph will not be deleted.

In this case the left subgraph will be deleted but the right subgraph and the middle one will stay. The right one will not be deleted even though it has some nodes with value 1 because there are also nodes with values greater than 1.

userids and values are the node properties.

enter image description here

I will be thankful if anyone can suggest me the cypher query that can be used to do that. Please note that the query will be on the whole graph, that is on all three subgraphs or more if there are anymore.

sjishan
  • 3,392
  • 9
  • 29
  • 53

1 Answers1

3

Thanks for the clarification, that's a tricky requirement, and it's not immediately clear to me what the best approach is that will scale well with large graphs, as most possibilities seem to be expensive full graph operations. We'll likely need to use a few steps to set up the graph for easier querying later. I'm also assuming you mean "disconnected subgraphs", otherwise this answer won't work.

One start might be to label nodes as :Alive or :Dead based upon the property value. It should help if all nodes are of the same label, and if there's an index on the value property for that label, as our match operations could take advantage of the index instead of having to do a full label scan and property comparison.

MATCH (a:MyNode)
WHERE a.value <= 1
SET a:Dead

And separately

MATCH (a:MyNode)
WHERE a.value > 1
SET a:Alive

Then your query to mark nodes to delete would be:

MATCH (a:Dead)
WHERE NOT (a)-[*]-(:Alive)
SET a:ToDelete

And if all looks good with the nodes you've marked for delete, you can run your delete operation, using apoc.periodic.commit() from APOC Procedures to batch the operation if necessary.

MATCH (a:ToDelete)
DETACH DELETE a

If operations on disconnected subgraphs are going to be common, I highly encourage using a special node connected to each subgraph you create (such as a single :Cluster node at the head of the subgraph) so you can begin such operations on :Cluster nodes, which would greatly speed up these kind of queries, since your query operations would be executed per cluster, instead of per :Dead node.

InverseFalcon
  • 29,576
  • 4
  • 38
  • 51
  • Thank you very much. It works perfectly. In original use case I will be having around 70 million nodes and each day I may need to delete 4-6 million nodes. So I hope it will scale fine. The :Cluster node idea is quite interesting but I am not sure if it is feasible given I will have new nodes coming in everyday. So, each time a node is added a search will be required to know if :Cluster node should be added or not. – sjishan Dec 17 '16 at 00:04
  • When you add new nodes, do you know if they will be linked to existing clusters, or if they will be starting new clusters of their own? If you know this, you can include a step where you create a :Cluster node right before adding the new node. Also, are your values ever going to change? If they're going to stay the same, it may be worth adding whatever appropriate label (:Alive or :Dead, although :Dead should probably use a different name) at creation time. If they will change, you may need to remove these labels after you're done with the delete operation so you don't operate on stale data. – InverseFalcon Dec 17 '16 at 01:16
  • No while adding new nodes it may not form a new cluster or form a new cluster. That is why I use `merge` operation for that. You can consider value as timestamp it will not change and the deletion is happening based on the timestamp. For instance lets say I am maintaining 10 days of graph data so each day there will be new deletion and insertion. New nodes on each day may add up with old cluster or may form new cluster by itself. – sjishan Dec 17 '16 at 02:03