7
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
import numpy as np
import matplotlib.pyplot as plt

# data
np.random.seed(4711)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)

plt.scatter(X[:,0], X[:,1])

enter image description here

# fit clusters
Z = linkage(X, method='ward', metric='euclidean', preserve_input=True)

# plot dendrogram

enter image description here

max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')

# now if I have new data
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X_test = np.concatenate((a, b),)
print(X_test.shape)  # 150 samples with 2 dimensions
plt.scatter(X_test[:,0], X_test[:,1])
plt.show()

enter image description here

how to compute distances for the new data and assign clusters using clusters from training data?

code references: joernhees.de

desertnaut
  • 57,590
  • 26
  • 140
  • 166
muon
  • 12,821
  • 11
  • 69
  • 88

1 Answers1

7

You don't.

Clustering does not have training and test stages. It is an explorative approach. You explore your data, and you can also explore your new data by rerunning the algorithm. But by the very nature of this algorithm you cannot meaningfully "assign" new data to the old structure because this data could completely change the discovered structure.

If you want classification, use a classifier.

Clustering algorithms are not substitutes for classifiers. If you want to classify new instances, use a classifier, and use e.g. this workflow:

  1. Explore the data with clustering (many times)
  2. Label the training data with the clusters your domain expert deems meaningful (validate the clstering!)
  3. Train a classifier
  4. Use the classifier to label new instances the same way

There are, of course, some exceptions. In k-means and Ward (but not e.g. in single-link) the nearest-centroid-classifier can somewhat apply the discovered model directly to new data. Still, this means "converting" the clustering into a static classifier, and the result may no longer be a local optimum on the full data set (see also: concept drift)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • If I want to apply different sets of models to each cluster, would it be meaningful to use distance metric(say euclidean) to assign clusters to new data, and use the fitted models for prediction? – muon Dec 29 '15 at 14:05
  • What about outliers? The example you had above is too simplistic. – Has QUIT--Anony-Mousse Dec 29 '15 at 18:17
  • 3
    This isn't defended very well. You should provide some evidence that assigning new data invalidates the original model. There is nothing in classification that would add a stronger guarantee. New data can change the assumed distributions of any supervised model as well. You shouldn't say "You Don't" ... that's too simplistic. There are libraries that allow you to predict the cluster assignment of new data and they exist for a reason. – Cybernetic Jan 02 '18 at 17:45
  • It works for k-means, but there are many many many more algorithms that cannot predict, for many reasons. Consider DBSCAN, Affinity Propagation, Spectral Clustering. – Has QUIT--Anony-Mousse Jan 03 '18 at 01:35
  • 1
    The whole purpose of doing clustering is to find new features that can be identified for supervised learning. I think the first paragraph should be corrected – webjockey Dec 05 '18 at 06:15
  • No, the main purpose of clustering is for a *human* to understand the data better. It is an EDA technique. The results usually are not quite good enough on real data to use them as features. – Has QUIT--Anony-Mousse Dec 05 '18 at 07:48
  • 2
    please don't presume to know what we're using the clustering for. perhaps we're trying to do something creative you haven't considered yet. in the meanwhile, the question isn't answered. – user108569 May 24 '20 at 08:18