4

I would like to measure the quality of clustering using Quantization Error but can't find any clear info regarding how to compute this metric.

The few documents/ articles I've found are:

Regarding the third link (which is the best piece of info I've found so far) I don't know how to interpret the calculation (see snippet below):

(the # annotations are mine. question marks indicate steps that are unclear to me)

def quantization_error(self):
        """
        This method calculates the quantization error of the given clustering
        :return: the quantization error
        """
        total_distance = 0.0
        s = Similarity(self.e) #Class containing different types of distance measures

        #For each point, compute squared fractional distance between point and centroid ?
        for i in range(len(self.solution.patterns)): 
            total_distance += math.pow(s.fractional_distance(self.solution.patterns[i], self.solution.centroids[self.solution.solution[i]]), 2.0)

        return total_distance / len(self.solution.patterns) # Divide total_distance by the total number of points ?

QUESTION: Is this calculation of the quantization error correct ? If no, what are the steps to compute it ?

Any help would be much appreciated.

solub
  • 1,291
  • 17
  • 40
  • The formulas and steps are documented quite well in many places on line. "How to implement in Python" suggests that you need a programming tutorial, rather than Stack Overflow. – Prune Jan 10 '18 at 01:07
  • 2
    @Prune I beg to differ with you. There is actually very little information regarding quantization error _when it comes to clustering._ If you have a specific online document or site in mind regarding this subject, I'd love to have a look. Also I don't need a programming tutorial. – solub Jan 10 '18 at 01:53
  • for each point: error += [norm](https://stackoverflow.com/a/32142023/86967)( original - updated ) – Brent Bradburn Jan 10 '18 at 04:06
  • @nobar Thanks for your comment. Could you explain what "original" and "updated" stand for when it comes to clustering ? Also, I have edited my question and the formula you're suggesting seems to differ from the one I found on another site. – solub Jan 10 '18 at 13:31
  • @solub I see where our differences lie, and your question update makes the actual problem quite clear; it's now a much better question. I withdrew my closure vote and reversed my down vote. – Prune Jan 10 '18 at 15:35
  • 1
    @nobar: I don't think your generalization is the help that OP needs -- your comment appears to be merely a restatement of the generic error concept. – Prune Jan 10 '18 at 15:37

1 Answers1

5

At the risk of restating things you already know, I'll cover the basics.

REVIEW

Quantization is any time we simplify a data set by moving each of the many data points to a convenient (nearest, by some metric) quantum point. These quantum points are a much smaller set. For instance, given a set of floats, rounding each one to the nearest integer is a type of quantization.

Clustering is a well-known, often-used type of quantization, one in which we use the data points themselves to determine the quantum points.

Quantization error is a metric of the error introduced by moving each point from its original position to its associated quantum point. In clustering, we often measure this error as the root-mean-square error of each point (moved to the centroid of its cluster).

YOUR SOLUTION

... is correct, in a very common sense: you've computed the sum-squared error of the data set, and taken the mean of that. This is a perfectly valid metric.

The method I see more often is to take the square root of that final mean, cluster by cluster, and use the sum of those roots as the error function for the entire data set.

THE CITED PAPER

One common question in k-means clustering (or any clustering, for that matter), is "what is the optimum number of clusters for this data set?" The paper uses another level of quantization to look for a balance.

Given a set of N data points, we want to find the optimal number 'm' of clusters, which will satisfy some rationalization for "optimum clustering". Once we find m, we can proceed with our usual clustering algorithm to find the optimal clustering.

We cant' simply minimize the error at all cost: using N clusters gives us an error of 0.

Is that enough explanation for your needs?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • 2
    First of all, I'd like to thank you for the clear and comprehensive explanation. I realise now that "Quantization Error" is nothing but another word to describe "variance" (along with "distortion", "within-cluster dissimilarities", or "inertia"). The term was so unfamiliar to me that I thought it was refering to a method very different from the usual ANOVA based approaches. How ironic. – solub Jan 10 '18 at 22:29