In an AWS Neptune graph with billions of nodes and edges, how would one go about finding the largest connected components efficiently? The reason I am trying to find the answer to this question is because usually large connected components in my domain indicate fraud. Most nodes in my graph only are connected to like tens of other nodes. It is suspicious when nodes are connected to hundreds or thousands of other nodes.
I have several questions:
- Is AWS Neptune an appropriate database for finding large connected components in a graph with billions of nodes and edges?
- Would it be more efficient to calculate PageRank for the graph? A high PageRank would similarly indicate fraud I believe. If so, how would I go about calculating PageRank?
- What architecture and algorithm could find the largest connected components?
- I am not just trying to find fraud that happened in the past but I am also trying to identify fraud in real time. As data is ingested, what would be a good way to identify a fraudulent node in real time? I am thinking that Neptune Streams and doing DFS on the node to get the entire connected component would be appropriate here.
- Eventually, years from now, when I've identified enough fraud, I am thinking I could do some sort of supervised machine learning. Not sure what the benefit of this would be though since most large connected components are fraud. It might be better at identifying harder to distinguish cases?
- Similarly to connected components and PageRank, are there other graph attributes I should look into that might indicate fraud in my case? I know this might be difficult to answer since I haven't revealed my domain.
Any help is much appreciated!