1

In an AWS Neptune graph with billions of nodes and edges, how would one go about finding the largest connected components efficiently? The reason I am trying to find the answer to this question is because usually large connected components in my domain indicate fraud. Most nodes in my graph only are connected to like tens of other nodes. It is suspicious when nodes are connected to hundreds or thousands of other nodes.

I have several questions:

  1. Is AWS Neptune an appropriate database for finding large connected components in a graph with billions of nodes and edges?
  2. Would it be more efficient to calculate PageRank for the graph? A high PageRank would similarly indicate fraud I believe. If so, how would I go about calculating PageRank?
  3. What architecture and algorithm could find the largest connected components?
  4. I am not just trying to find fraud that happened in the past but I am also trying to identify fraud in real time. As data is ingested, what would be a good way to identify a fraudulent node in real time? I am thinking that Neptune Streams and doing DFS on the node to get the entire connected component would be appropriate here.
  5. Eventually, years from now, when I've identified enough fraud, I am thinking I could do some sort of supervised machine learning. Not sure what the benefit of this would be though since most large connected components are fraud. It might be better at identifying harder to distinguish cases?
  6. Similarly to connected components and PageRank, are there other graph attributes I should look into that might indicate fraud in my case? I know this might be difficult to answer since I haven't revealed my domain.

Any help is much appreciated!

natsbat4ws
  • 67
  • 2
  • 8
  • Hello. Lots of questions here. I suspect a moderator may come along and close the question as it is looking for recommendations, which is something Stack Overflow tends to not allow. That said, without knowing more about the data model you have in mind and how frequently you want the queries to run it is hard to give concrete answers. It sounds as if you are talking about full graph (OLAP) analytics which is often done as a sort of batch job. If you just want to explore the graph starting from a given known start, that is a different approach that needs other considerations. – Kelvin Lawrence Jan 19 '22 at 15:36
  • Hi Kelvin, thanks for responding. I guess i just have one main question. In AWS Neptune how would one find largest connected components? This is a specific question. I am looking for queries and architecture that could do this. Any help would be much appreciated! – natsbat4ws Jan 19 '22 at 19:08
  • I'll add a few words in an answer. – Kelvin Lawrence Feb 01 '22 at 15:46

1 Answers1

1

Connected component finding queries can be expressed in Gremlin, but whether or not those queries will be efficient, is going to depend on the complexity of the graph. I would start by looking at the Gremlin Recipes document.

You will find several algorithms discussed there.

At very large scale, you may want to export data from the graph and run a Spark job (or similar) to find the fraud rings etc.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38