Python Clustering 'purity' metric

Question

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.

I could use the function score() to compute the log probability under the model.

However, I am looking for a metric called 'purity' which is defined in this article.

How can I implement it in Python? My current implementation looks like this:

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

But I can not loop through each cluster in order to compute the confusion matrix (according this question)

That paper is pretty opaque. [This answer](http://stats.stackexchange.com/a/154379/89612) on crossvalidated simplifies the procedure a bit. — kdbanman, Dec 02 '15 at 16:23
Please post the code you have so far, and tell us about the data structures involved. — kdbanman, Dec 02 '15 at 16:29
At the moment, my code is: `from sklearn.mixture import GMM clusterer = GMM(5, 'diag') clusterer.fit(X) cluster_labels = clusterer.predict(X)` I see that in order to compute the purity I need the confusion matrix. Now, my problem is that I can't loop through each cluster and count how many objects were classified as each class — Kuka, Dec 02 '15 at 16:41
Alright. And what is `X`? Is it a numpy array? If so, what are its dimensions and what data does it contain? (*Notice how I edited that code into the body of your question. Please do that from now on when you have something additional to share*) :) — kdbanman, Dec 02 '15 at 18:16
Yes, it's a NumPy array (1000L, 2L). Data are extracted from MNIST dataset (200 examples for 5 classes) and I read them as a float type. Then, I computed the PCA in order to reduce the dimensionality and now my task is to cluster X using GMM varying the number of clusters and to compute purity for every choice of number of clusters. — Kuka, Dec 02 '15 at 19:25
You say "*my problem is that I can't loop through each cluster and count...*", but it's difficult to help with just that information. Please show us the problematic code and describe the problem by **editing it into your question.** — kdbanman, Dec 02 '15 at 22:04

Ugurite · Answer 1 · 2018-08-03T20:41:24.400

24

David's answer works but here is another way to do it.

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".

edited Aug 03 '18 at 20:41

answered Aug 03 '18 at 12:30

Ugurite

503
1
4
11

1

This is a very concise answer, even better using scikit-learn own functions! – David Aug 12 '18 at 16:20

kdbanman · Answer 2 · 2017-05-13T16:31:05.947

5

sklearn doesn't implement a cluster purity metric. You have 2 options:

Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.

edited May 13 '17 at 16:31

answered Dec 02 '15 at 16:29

kdbanman

10,161
10
46
78

David · Answer 3 · 2018-05-19T22:04:20.373

A very late contribution.

You can try to implement it like this, pretty much like in this gist

def purity_score(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

Hi @Hadij, it does not always gives zeros, but indeed it has major flaws. I was notified (see [comments](https://gist.github.com/jhumigas/010473a456462106a3720ca953b2c4e2) that it didn't work when the true labels were unordered or/and not starting by zero. I have updated the function, feedbacks are appreciated — David, May 19 '18 at 21:58

Bai Li · Answer 4 · 2019-07-18T21:56:58.587

The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.

For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.

Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:

import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment

def cluster_accuracy(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)

    # Find optimal one-to-one mapping between cluster labels and true labels
    row_ind, col_ind = linear_sum_assignment(-contingency_matrix)

    # Return cluster accuracy
    return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

I appreciate your concern. However, purity is not based on a one-to-one mapping because, generally, the numbers of clusters and classes are different. It is true that a trivial way to achieve a score of purity of 1 is by putting each data point in its own cluster. For more, please see [Introduction to Information Retrieval (book)](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html) or [how-to-calculate-purity](https://stats.stackexchange.com/questions/95731/how-to-calculate-purity) — Ugurite, Jul 18 '19 at 20:04
Sorry -- on closer reading, your code does correctly implement the purity metric described in the link — Bai Li, Jul 18 '19 at 21:40
`linear_sum_assignment` only works if the number of clusters is equal to the number of classes. otherwise, surplus classes/clusters are dropped. — moi, Sep 21 '21 at 14:06

Python Clustering 'purity' metric

4 Answers4