I'm trying to implement the Rank-Order Clustering here is a link to the paper (which is a kind of agglomerative clustering) algorithm from scratch. I have read through the paper (many times) and I have an implementation that is working although it is a lot slower than I expect.
Here is a link to my Github which has instructions to download and run the Jupyter Notebook.
The algorithm:
Algorithm 1 Rank-Order distance based clustering
Input:
N faces, Rank-Order distance threshold t .
Output:
A cluster set C and an “un-grouped” cluster Cun.
1: Initialize clusters C = { C1, C2, … CN }
by letting each face be a single-element cluster.
2: repeat
3: for all pair Cj and Ci in C do
4: Compute distances DR( Ci, Cj) by (4) and DN(Ci, Cj) by (5).
5: if DR(Ci, Cj) < t and DN(Ci, Cj) < 1 then
6: Denote ⟨Ci, Cj⟩ as a candidate merging pair.
7: end if
8: end for
9: Do “transitive” merge on all candidate merging pairs.
(For example, Ci, Cj, Ck are merged
if ⟨Ci, Cj⟩ and ⟨Cj, Ck⟩ are candidate merging pairs.)
10: Update C and absolute distances between clusters by (3).
11: until No merge happens
12: Move all single-element clusters in C into an “un-grouped” face cluster Cun.
13: return C and Cun.
My implementation:
I have defined a Cluster
class like so:
class Cluster:
def __init__(self):
self.faces = list()
self.absolute_distance_neighbours = None
A Face
class like so:
class Face:
def __init__(self, embedding):
self.embedding = embedding # a point in 128 dimensional space
self.absolute_distance_neighbours = None
I have also implemented finding the rank-order distance (D^R(C_i, C_j))
and the normalized distance (D^N(C_i, C_j))
used in step 4
so these can be taken for granted.
Here is my implementation for finding the closest absolute distance between two clusters:
def find_nearest_distance_between_clusters(cluster1, cluster2):
nearest_distance = sys.float_info.max
for face1 in cluster1.faces:
for face2 in cluster2.faces:
distance = np.linalg.norm(face1.embedding - face2.embedding, ord = 1)
if distance < nearest_distance:
nearest_distance = distance
# If there is a distance of 0 then there is no need to continue
if distance == 0:
return(0)
return(nearest_distance)
def assign_absolute_distance_neighbours_for_clusters(clusters, N = 20):
for i, cluster1 in enumerate(clusters):
nearest_neighbours = []
for j, cluster2 in enumerate(clusters):
distance = find_nearest_distance_between_clusters(cluster1, cluster2)
neighbour = Neighbour(cluster2, distance)
nearest_neighbours.append(neighbour)
nearest_neighbours.sort(key = lambda x: x.distance)
# take only the top N neighbours
cluster1.absolute_distance_neighbours = nearest_neighbours[0:N]
Here is my implementation of the rank-order clustering algorithm (assume that the implementation of find_normalized_distance_between_clusters
and find_rank_order_distance_between_clusters
is correct):
import networkx as nx
def find_clusters(faces):
clusters = initial_cluster_creation(faces) # makes each face a cluster
assign_absolute_distance_neighbours_for_clusters(clusters)
t = 14 # threshold number for rank-order clustering
prev_cluster_number = len(clusters)
num_created_clusters = prev_cluster_number
is_initialized = False
while (not is_initialized) or (num_created_clusters):
print("Number of clusters in this iteration: {}".format(len(clusters)))
G = nx.Graph()
for cluster in clusters:
G.add_node(cluster)
processed_pairs = 0
# Find the candidate merging pairs
for i, cluster1 in enumerate(clusters):
# Only get the top 20 nearest neighbours for each cluster
for j, cluster2 in enumerate([neighbour.entity for neighbour in \
cluster1.absolute_distance_neighbours]):
processed_pairs += 1
print("Processed {}/{} pairs".format(processed_pairs, len(clusters) * 20), end="\r")
# No need to merge with yourself
if cluster1 is cluster2:
continue
else:
normalized_distance = find_normalized_distance_between_clusters(cluster1, cluster2)
if (normalized_distance >= 1):
continue
rank_order_distance = find_rank_order_distance_between_clusters(cluster1, cluster2)
if (rank_order_distance >= t):
continue
G.add_edge(cluster1, cluster2) # add an edge to denote that these two clusters are to be merged
# Create the new clusters
clusters = []
# Note here that nx.connected_components(G) are
# the clusters that are connected
for _clusters in nx.connected_components(G):
new_cluster = Cluster()
for cluster in _clusters:
for face in cluster.faces:
new_cluster.faces.append(face)
clusters.append(new_cluster)
current_cluster_number = len(clusters)
num_created_clusters = prev_cluster_number - current_cluster_number
prev_cluster_number = current_cluster_number
# Recalculate the distance between clusters (this is what is taking a long time)
assign_absolute_distance_neighbours_for_clusters(clusters)
is_initialized = True
# Now that the clusters have been created, separate them into clusters that have one face
# and clusters that have more than one face
unmatched_clusters = []
matched_clusters = []
for cluster in clusters:
if len(cluster.faces) == 1:
unmatched_clusters.append(cluster)
else:
matched_clusters.append(cluster)
matched_clusters.sort(key = lambda x: len(x.faces), reverse = True)
return(matched_clusters, unmatched_clusters)
The problem:
The reason for the slow performance is due to step 10: Update C and absolute distance between clusters by (3)
where (3)
is:
This is the smallest L1-norm distance between all the faces in C_i (cluster i)
and C_j (cluster j)
After merging the clusters
Since I have to recalculate the absolute distances between the newly created clusters every time I finish finding the candidate merging pairs in steps 3 - 8
. I'm basically having to do a nested for loop for all the created cluster and then having ANOTHER nested for loop to find the two faces that have the closest distance. Afterwards, I still have to sort the neighbours by nearest distance!
I believe that this is the wrong approach as I have looked at OpenBR which has also implemented the same Rank-Order Clustering algorithm that I want it is under the method name:
Clusters br::ClusterGraph(Neighborhood neighborhood, float aggressiveness, const QString &csv)
Although I'm not that familiar with C++ I'm pretty sure that they are not recalculating the absolute distances between the clusters which leads me to believe that this is the part of the algorithm that I am implementing wrongly.
Moreover, at the top of their method declaration the comments say that they have pre-computed a kNN graph which makes sense as when I recalculate the absolute distances between clusters I am doing a lot of computation that I have previously done. I believe that the key is to precompute a kNN graph for the clusters although this is the part that I'm stuck at. I can't think of how to implement the data structure so that the absolute distances of the clusters would not have to be recalculated every time they are merged.