2

I have the following sklearn clusters obtained using affinity propagation.

import sklearn.cluster
import numpy as np

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
labels = affprop.labels_
#number of clusters
n_clusters_ = len(cluster_centers_indices)

Now I want to plot the output of the clusters. I am new to sklearn. Please suggest me a suitable approach to plot the clusters in python. Is it possible to do this with pandas dataframes?

EDIT:

I used the code in sklearn directly as follows as pointed by @MohammedKashif.

import sklearn.cluster

import numpy as np

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = sims[cluster_centers_indices[k]]
    plt.plot(sims[class_members, 0], sims[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in sims[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

However, the output I get is bit weird as follows (The second cluster point (green) is on the blue line. Hence, I don't think it should be clustered as a separate one and should also be in the blue cluster). Please let me know if I have made any mistakes in the code. enter image description here

Edit 2

As pointed by σηγ I added:

se = SpectralEmbedding(n_components=2, affinity='precomputed')
X = se.fit_transform(sims)
print(X)

However, for the array np.array([[0, 17, 10, 32, 32], [0, 17, 10, 32, 32], [0, 17, 10, 32, 33], [0, 17, 10, 32, 32], [0, 17, 10, 32, 32]]) it gave me 3 points as shown below. That confuses me because all the 5 arrays represents one point.

enter image description here

Please help me.

  • 1
    you can see the example here for more reference : http://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py – Gambit1614 Sep 15 '17 at 06:03
  • @MohammedKashif Thank you for your comment. Can we directly change the `X` in the code to `sims`, because the output graph I get, is not what I expected? –  Sep 15 '17 at 06:13
  • 1
    Yes you will have to change the variable names accordingly. – Gambit1614 Sep 15 '17 at 06:14
  • @MohammedKashif Can you please see at my answer and let me know if its is wrong :) –  Sep 15 '17 at 06:39
  • 1
    That looks mostly expected I would say - you only have 5 data points, 2 of which are the cluster centres, and the other 3 are assigned to the top left/blue cluster. So that graph is probably what I would expect. What are you expecting to see? – Ken Syme Sep 15 '17 at 07:01
  • @KenSyme The second cluster point (green) is on the blue line. Hence, I don't think it should be clustered as a separate one and should also be in the blue cluster. What do you think? –  Sep 15 '17 at 07:13
  • 1
    @Volka it looks to me the line is just passing under that point, not that it is on it. You have clustered based on 5 "features" but are only plotting the first 2, so are not seeing the full picture of why it has clustered, try plotting other combinations to see the different clusters, or could investigate things like PCA or TSNE to map your 5 features into 2 for plotting. – Ken Syme Sep 15 '17 at 08:56
  • 2
    @Volka Sims looks like a similarity matrix and not a feature or a coordinate array. If you want to visualize the data based on the similarities, you should choose a method that works directly on the similiarity matrix (e.g. [SpectralEmbedding](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html#sklearn.manifold.SpectralEmbedding) in sklearn). – σηγ Sep 15 '17 at 17:56
  • @σηγ I did nt get you as I am very new to this area. Could you please elaborate your idea? :) –  Sep 16 '17 at 09:07

1 Answers1

0

Following the previous example, I would try something like this:

import sklearn.cluster
from sklearn.manifold import SpectralEmbedding
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

se = SpectralEmbedding(n_components=2, affinity='precomputed')
X = se.fit_transform(sims)

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()       

AP_SE

σηγ
  • 1,294
  • 1
  • 8
  • 15
  • 1
    Interesting! What exactly happens with SpectralEmbedding? –  Sep 16 '17 at 10:42
  • 2
    Spectral embedding (aka Laplacian eigenmaps) attempts to find a low-dimensional representation of a high-dimensional data set so that the local distance between points in the low-dimensional representation approximates their distance (or similarity) in the high-dimensional space (cf. [Wikipedia](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Laplacian_eigenmaps)). – σηγ Sep 16 '17 at 13:08
  • 2
    Actually many of the manifold learning methods in `sklearn.manifold` aim to do the same thing but with different algorithms. However, most of them require a set of feature vectors or a distance matrix to work with. – σηγ Sep 16 '17 at 13:08
  • @σηγ Many thanks for your wonderful answer. I tried your code with `np.array([[0, 17, 10, 32, 32], [0, 17, 10, 32, 32], [0, 17, 10, 32, 33], [0, 17, 10, 32, 32], [0, 17, 10, 32, 32]])` Even though, the five arrays represent the same point, it shows 3 different points. Do you know why that happens? –  Sep 16 '17 at 13:27
  • 1
    I think SpectralEmbedding does not handle well cases where the points are overlapping. Anyway, that new array does not look like a similarity matrix (why are the diagonal elements that should correspond to self-similarities non-equal, if the array describes the same points?). If those are actually feature vectors, you could replace the SpectralEmbedding part with another projection, e.g. `X = sklearn.manifold.MDS(n_components=2).fit_transform(new_array)`. The result should be a plot with just one data point. – σηγ Sep 17 '17 at 13:32
  • @σηγ Thank you very much! Please let me know if you know an answer for this: https://stackoverflow.com/questions/46265803/plot-specific-points-in-dbscan-in-sklearn-in-python –  Sep 18 '17 at 00:45