2

I'm using Spark and graphX to make a graph that represents similar images (image names are used as vertices and there's an edge if two pictures have a label in common). As far as I know, graphX partitions data to be stored on separate machines, but these partitions don't represent the possible clusters of the graph. Is there a way I can create subgraphs that represent possible clusters of a graph using graphx, where a cluster is the most connected portion of a graph that's least connected to the other nodes?

Here is what I'm trying to do stepwise:

  1. Give labels to each photo in a dataset with a certain probability
  2. Compare the labels of each photo with every other photo and save the similar image names in a tuple (for example, if image1 and image 53 have the label 'dog' with a probability greater than 0.5, store them as 'image1, image53')
  3. Make a graph using Graphx where the vertices are the image names and the Edges are between those vertices that are 'similar'.
  4. Divide this graph into clusters, i.e. I want subgraphs of the highly connected components of the graph, if there exist any, which then, I want to store as 'albums'
CMWasiq
  • 79
  • 10
  • This is more of an algorithm question than a Spark question. It seems like you want to use `GraphOps.connectedComponents()` and `GraphOps.collectNeighbors()` in some combination. But maybe if you laid out in pseudo-code what you are trying to do algorithmically it might make more sense. – David Griffin Mar 10 '16 at 20:54

0 Answers0