0

I have a very large, weighted graph on Azure COSMOS DB. Number of vertices and edges are in billions and size of DB is several TBs. I am trying to cluster the graph on Spark using some custom clustering algorithm.

I understood this can be done using Spark and GraphFrames. I can also find some old algorithm online which uses GraphX and Pregel Framework. But i understand it is better to be implemented in GraphFrames now, for which i am not able to find any examples. I watched several videos, read blogs and could create a small graph and play around with it using GraphFrames (using inbuilt APIs like LPA, BFS, etc)

My Questions:

  1. How to implement graph clustering using GraphFrames? Is there any example a custom graph clustering algorithm using GraphFrames which can run in the distributed fashion? Will just using Graph/Data Frame and writing regular clustering code take care of distrusted processing? or do I have to write in certain way (similar to GraphX or Pregel)?

  2. How do I load the entire graph and run my clustering algorithm. When I load it on GraphFrame, will it load the entire data (several TBs) in memory? Or does it automatically load only that is necessary or should i write some custom code to load what is needed during the processing?

Apologies if the questions are basic, I am new to Spark, Clustering and Graph Frames.

  • Hey this question looks very involved, do you know what clustering algorithm you want to implement specifically? – FJ_OC Sep 01 '22 at 13:40
  • I am still figuring it up. But does that matter? I was expecting any graph algorithm can be implemented using Graph Frames. One possible algorithm is [link](https://www.researchgate.net/publication/222648006_A_Clustering_Algorithm_Based_on_Graph_Connectivity) – 0xcoder Sep 01 '22 at 19:07

0 Answers0