I'm using Spark and graphX to make a graph that represents similar images (image names are used as vertices and there's an edge if two pictures have a label in common). As far as I know, graphX partitions data to be stored on separate machines, but these partitions don't represent the possible clusters of the graph. Is there a way I can create subgraphs that represent possible clusters of a graph using graphx, where a cluster is the most connected portion of a graph that's least connected to the other nodes?
Here is what I'm trying to do stepwise:
- Give labels to each photo in a dataset with a certain probability
- Compare the labels of each photo with every other photo and save the similar image names in a tuple (for example, if image1 and image 53 have the label 'dog' with a probability greater than 0.5, store them as 'image1, image53')
- Make a graph using Graphx where the vertices are the image names and the Edges are between those vertices that are 'similar'.
- Divide this graph into clusters, i.e. I want subgraphs of the highly connected components of the graph, if there exist any, which then, I want to store as 'albums'