How to pick the T1 and T2 threshold values for Canopy Clustering?

Question

I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy center, determined?

Problem context:

The problem I'm trying to solve is, I have a set of numbers such as [1,30] or [1,250] with set sizes of about 50. There can be duplicate elements and they can be floating point numbers as well, such as 8, 17.5, 17.5, 23, 66, ... I want to find the optimal clusters, or subsets of the set of numbers.

So, if Canopy clustering with K-means is a good choice, then my questions still stands: how do you find the T1, T2 values?. If this is not a good choice, is there a better, simpler but effective algorithm to use?

Here is another similar question http://stats.stackexchange.com/questions/13895/how-do-i-algorithmically-determine-values-of-t1-t2-for-canopy-clustering — cyraxjoe, Oct 26 '11 at 21:46
Have you had any luck with this yet? I'm looking to use Canopy Clustering to find an initial cluster set to feed to K-Means. I might just use the "Jump Method" as described [here](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set) (which sounds similar to the method @rpd describes in his answer), but if you've found a good way to determine T1 and T2 I'd like to use Canopy Clustering instead. — JesseBuesking, Jul 27 '12 at 22:50

score 2 · Answer 1 · answered Nov 09 '11 at 07:43

Perhaps naively, I see the problem in terms of a sort of spectral-estimation. Suppose I have 10 vectors. I can compute the distances between all pairs. In this case I'd get 45 such distances. Plot them as a histogram in various distance ranges. E.g. 10 distances are between 0.1 and 0.2, 5 between 0.2 and 0.3 etc. and you get an idea of how the distances between vectors are distributed. From this information you can choose T1 and T2 (e.g. choose them so that you cover the distance range that is the most populated).

Of course, this is not practical for a large dataset - but you could just take a random sample or something so that you at least know the ballpark of T1 and T2. Using something like Hadoop you could do some sort of prior spectral estimation on a large number of points. If all incoming data you are trying to cluster is distributed in much the same way then you cjust need to get T1 and T2 once, then fix them as constants for all future runs.

score 2 · Answer 2 · answered Jan 15 '12 at 12:12

Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.

How to pick the T1 and T2 threshold values for Canopy Clustering?

2 Answers2

Linked