2

After clustering a distance matrix with scipy.cluster.hierarchy.linkage, and assigning each sample to a cluster using scipy.cluster.hierarchy.cut_tree, I would like to extract one element out of each cluster, which is the closest to that cluster's centroid.

  • I would be the happiest if an off-the-shelf function existed for this, but in the lack thereof:
  • some suggestions were already proposed here for extracting the centroids themselves, but not the closest-to-centroid elements.
  • Note that this is not to be confused with the centroid linkage rule in scipy.cluster.hierarchy.linkage. I have already carried out the clustering itself, just want to access the closest-to-centroid elements.
Community
  • 1
  • 1
deefff
  • 333
  • 2
  • 9

2 Answers2

3

Nearest neighbours are most efficiently computed using KD-Trees. E.g.:

from scipy.spatial import cKDTree

def find_k_closest(centroids, data, k=1, distance_norm=2):
    """
    Arguments:
    ----------
        centroids: (M, d) ndarray
            M - number of clusters
            d - number of data dimensions
        data: (N, d) ndarray
            N - number of data points
        k: int (default 1)
            nearest neighbour to get
        distance_norm: int (default 2)
            1: Hamming distance (x+y)
            2: Euclidean distance (sqrt(x^2 + y^2))
            np.inf: maximum distance in any dimension (max((x,y)))

    Returns:
    -------
        indices: (M,) ndarray
        values: (M, d) ndarray
    """

    kdtree = cKDTree(data, leafsize=leafsize)
    distances, indices = kdtree.query(centroids, k, p=distance_norm)
    if k > 1:
        indices = indices[:,-1]
    values = data[indices]
    return indices, values

indices, values = find_k_closest(centroids, data)
Paul Brodersen
  • 11,221
  • 21
  • 38
  • Thanks for your suggestion. I've looked into it but realized that it is not suited for my problem, because I have a "flat" distance matrix (as returned by `scipy.spatial.distance.pdist`), where the distances are calculated in a non-trivial way between custom-class objects. As such, I realize that centroids are not even applicable here (lack of proper Cartesian coordinates), so I'll just go with selecting that object with the least average distance to the other objects of the cluster. I'll post my final solution as soon as it's ready. – deefff Oct 03 '16 at 14:08
  • If you are using hierarchical clustering, then you have to have defined a distance measure between points. Just saying. – Paul Brodersen Oct 03 '16 at 14:12
  • Yes, but it's somewhat special. Specifically, I have 3D objects (2-4 points in 3D) and I calculate RMS distances between them, but first I rotate and translate them to get optimal alignment (note that this implies a different rotational/translational state for each pair of objects). My distance matrix contains these RMS distances. `kdtree.query` seems pretty much hardcoded with Minkowski-type norms, I don't know how I could implement this "fitting-and-measuring" routine in place of the `distance_norm`. But after all, I might not even need to. – deefff Oct 03 '16 at 15:15
  • You need to include such information in your question! Anyway, hope you figure something out. All the best. – Paul Brodersen Oct 04 '16 at 09:36
  • Sry, I didn't realize but I guess it's logical that people will think of multidimensional arrays first. Nonetheless, I tried out your code and the line `indices = idx[:,-1]` raises `SystemError: error return without exception set` if k=1. I would propose the following: if k==1: indices=idx[:] else: indices = idx[:,-1] – deefff Oct 04 '16 at 11:58
  • Good catch. Edited. – Paul Brodersen Oct 04 '16 at 14:50
1

Paul's solution above works well for multidimensional arrays. In the more specific case, where you have a distance matrix dm, in which the distances are calculated in a "non-trivial" way (e.g. each pair objects is aligned in 3D first, then RMSD is calculated), I ended up selecting from each cluster the element whose sum of distances to the other elements in the cluster is the lowest, aka. the cluster's medoid. (See discussion below this answer.) This is how I did it in possession of the distance matrix dm and the list of object names in identical order names:

import numpy as np
import scipy.spatial.distance as spd
import scipy.cluster.hierarchy as sch

# Square form of distance matrix
sq=spd.squareform(dm)
# Perform clustering, capture linkage object
clusters=sch.linkage(dm,method=linkage)
# List of cluster assignments
assignments=sch.cut_tree(clusters,height=rmsd_cutoff)
# Store object names and assignments as zip object (list of tuples)
nameList=list(zip(names,assignments))

### Extract models closest to cluster centroids
counter=0
while counter<num_Clusters+1:

    # Create mask from the list of assignments for extracting submatrix of the cluster
    mask=np.array([1 if i==counter else 0 for i in assignments],dtype=bool)

    # Take the index of the column with the smallest sum of distances from the submatrix
    idx=np.argmin(sum(sq[:,mask][mask,:]))

    # Extract names of cluster elements from nameList
    sublist=[name for (name, cluster) in nameList if cluster==counter]

    # Element closest to centroid
    centroid=sublist[idx]
Community
  • 1
  • 1
deefff
  • 333
  • 2
  • 9
  • 1
    "I ended up selecting from each cluster the element whose sum of distances to the other elements in the cluster is the lowest" is this the same as find **medoid** of a cluster?? – Blue482 Dec 13 '16 at 22:19