Is there an efficient way to cluster a graph according to Jaccard similarity?

Question

Is there an efficient way to cluster nodes in a graph using Jaccard similarity such that each cluster has at least K nodes?

Jaccard similarity between nodes i and j:
Let S be the set of neighbours of i and T be the set of neighbours of j. Then the similarity between i and j is given by |(S ⋂ T)| / |(S ⋃ T)|.

What format is the graph described in? As in, is it an adjacency list, adjacency matrix, etc? Is the graph reliably sparse or dense? Do you know how the degree of a node changes as the graph grows? (in particular, does it stay constant? Does it increase linearly?) — Andy Jones, Dec 20 '13 at 22:27
The graph is described as an adjacency list and should be sparse. — HHH, Dec 20 '13 at 22:33
Okay. And what kind of clusters do you want? Clusters that maximize the minimum intracluster similarity metric? Clusters that minimize the average intercluster similarity metric? Etc. Next: do you want the absolute optimium clustering? If not, how poor of an approximation will you accept? — Andy Jones, Dec 20 '13 at 23:29
cluster which maximizes the intracluster similarity. I'm preferably looking for optimal solution, though a good approximation algorithm is acceptable. — HHH, Dec 20 '13 at 23:40

score 1 · Answer 1 · answered Dec 21 '13 at 11:35

Have you tried implementing some algorithm yourself?

Compute all pairwise non-zero similarities (i.e. when they have at least one neighbor in common; this makes the candidate set much smaller than a squared matrix).

Sort them by similarity, and process pairs in decreasing similarity. Initially, each object is their own cluster.

When A and B are not yet in the same cluster, and either cluster has less than k members, join the two clusters. Repeat until all similarities have been processed.

Note that you may still end up having clusters with less than k members. For example, if your data set has less than k nodes total, or there are small subgraphs that are not connected etc.

You really should accept clusters of less than k nodes, i.e. unclustered nodes. Why would everything cluster? There always will be outliers and noise in real data.

Is there an efficient way to cluster a graph according to Jaccard similarity?

1 Answers1