I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in Python):
''' v = an array (N,d), where rows are the observations
and columns the dimensions'''
def create_dist_matrix(v):
N = v.shape[0]
D = np.zeros((N,N))
for i in range(N):
for j in range(i+1):
D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance.cosine()
return D
I was wondering which is the best way to add some parallelism to this routine. An easy way would be to break and assign the outer loop to a number of jobs, e.g. if you have 10 processors, create 10 different jobs for different ranges of i
and then concatenate the results. However this "horizontal" solution doesn't seem quite right. Are there any other parallel algorithms (or existing libraries) for this task? Any help would be highly appreciated.