Labels obtained from clustering seem visually incorrect

Question

I have the following distance matrix based on 10 datapoints:

import numpy as np

distance_matrix = np.array([[0.        , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
                             0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
                            [0.00981376, 0.        , 0.06148157, 0.00563764, 0.04473798,
                             0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
                            [0.0698306 , 0.06148157, 0.        , 0.05693448, 0.02083512,
                             0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
                            [0.01313118, 0.00563764, 0.05693448, 0.        , 0.0408836 ,
                             0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
                            [0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0.        ,
                             0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
                            [0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
                             0.        , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
                            [0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
                             0.01284173, 0.        , 0.12310601, 0.01916205, 0.05188323],
                            [0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
                             0.13540522, 0.12310601, 0.        , 0.11271352, 0.07346808],
                            [0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
                             0.03010677, 0.01916205, 0.11271352, 0.        , 0.04157886],
                            [0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
                             0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0.        ]])

I transform the distance_matrix to an affinity_matrix by using the following

delta = 0.1 
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))

Which gives

affinity_matrix = np.array([[1.        , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
                             0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
                            [0.99519608, 1.        , 0.82778719, 0.99841211, 0.90477015,
                             0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
                            [0.7836321 , 0.82778719, 1.        , 0.85037594, 0.97852875,
                             0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
                            [0.99141566, 0.99841211, 0.85037594, 1.        , 0.91982353,
                             0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
                            [0.86691389, 0.90477015, 0.97852875, 0.91982353, 1.        ,
                             0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
                            [0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
                             1.        , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
                            [0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
                             0.9917884 , 1.        , 0.46871776, 0.9818083 , 0.87407117],
                            [0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
                             0.39982486, 0.46871776, 1.        , 0.52982057, 0.76347268],
                            [0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
                             0.95569077, 0.9818083 , 0.52982057, 1.        , 0.91719051],
                            [0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
                             0.81114925, 0.87407117, 0.76347268, 0.91719051, 1.        ]])

I transform the distance_matrix into a heatmap to get a better visual of the data

import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)

Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.

Next I run the actual clustering.

from sklearn.cluster import SpectralClustering    
clustering = SpectralClustering(n_clusters=3,
                            assign_labels='kmeans',
                            affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1

The outputs yields

[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]

So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.

`Clustering` is a non-supervised learning, therefore, it might do what you expect, it might not. I think you should try out other number of clusters (why did you choose 3?), so it might help to take a look [here](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). There are many methods to do it, some are visual and others are metrics, but you definetly should take a look at those methods before assuming a best number of clusters. — Felipe Whitaker, Oct 24 '20 at 13:59
Thanks for the help. The number of clusters will always be 2 or 3 for this particular project. I get that there will surprises to some extend, but in the distance matrix #8 looks like it is on its own planet right? — HJA24, Oct 24 '20 at 14:10
They do look like they might be different from the rest, but the distances are truly close (they go from 0 to 0.14), so it might be a scale problem. I wouldn't bug myself too much if they fell into a cluster with some other data point. Sincerely, I would check their fit to the training data, and maybe try other clustering methods. They are all unsupervised, so you should always check if they did what you expect (and if they did something that you didn't expect, what can you learn from it?) — Felipe Whitaker, Oct 24 '20 at 20:34
okay, but why is #9 a cluster on itself? It doesn't makes sense intuitively. I used for scaling the heuristic from Ng, Jordan and Weiss where you loop over a range of scaling factors and pick the value that, after clustering, gives the tightest (smallest distortion) clusters. — HJA24, Oct 24 '20 at 20:43
Thats is, indeed, a good question. Maybe you could plot the two vectors PCA transformation of it for visualization to try to figure out. Sorry if I wasn't too helpful. — Felipe Whitaker, Oct 25 '20 at 04:04

desertnaut · Answer 1 · 2020-10-25T01:23:30.263

When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:

Apply clustering to a projection of the normalized Laplacian.

In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plane.

Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.

Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get

from sklearn.cluster import KMeans    
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1 
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)

where indeed sample #8 stands out as an outlier in a cluster of its own (#3).

Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?

The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

Labels obtained from clustering seem visually incorrect

1 Answers1