Efficient data structure for quality threshold clustering algorithm

Question

I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:

Initialize the threshold distance allowed for clusters and the minimum cluster size
Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size

I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.

I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.

Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).

While the feedback I've gotten to questions so far has been very useful, none of them have actually answered any of those questions, so I thought it best not to accept any. — NordCoder, Dec 17 '12 at 08:29

score 2 · Accepted Answer · answered Dec 17 '12 at 07:37

This algorithm sounds like a predecessor of DBSCAN (Wikipedia), which is known to work very well with R*-Tree indexes (Wikipedia). But of course, kd-trees are also an option. The main difference between these two is that R*-trees are meant for database use - they support online insertions and deletions very well, and are block oriented - while kd-trees are more of an in-memory data structure based on binary splits. R*-trees perform rebalancing, while kd-trees will slowly become unbalanced and will need to be rebuilt. I find nearest neighbor search in R*-trees much more understandable than in k-d-trees, because you have the bounding rectangles are very intuitive.

DBSCAN also "removes" points from further consideration, but simply by marking them as already assigned. That way you don't need to update the index; and it's sufficient to bulk-load it once in the beginning. You should be able to do this for QT, too. So unless I'm mistaken, you can get the QT clustering efficiently by running DBSCAN with epsilon set to the QT clustering and minPts=2 (although one would prefer higher values in proper DBSCAN).

There are a number of DBSCAN implementations around. The one in Weka is exceptionally crappy, so stay away from it. The fpc implementation in R is okay, but could still be a lot faster. ELKI seems to be the only one with full index support, and the speed difference is massive. Their Benchmark shows a 12x speed gain by using an index on this data set, allowing them to cluster in 50 seconds instead of 603 (without index). Weka took incredible 37917 seconds, R fpc 4339 there. That aligns with my experiences, Weka has the reputation of being quite slow, and R only kicks ass at vectorized operations, once the R interpreter has to work, it is significantly slower than anything native. But it is a good example about how different the same algorithm can perform when it is implemented by different people. I would have expected this to be 2x-5x, but apparently the differences can easily be 50x from one programmer implementing the same algorithm to another.

Thanks for the input. I looked at the wiki links you supplied, and with a few modifications, I think I might be able to use them. I hadn't come across either before (though I did look at R-trees), so I'll continue to investigate :) I was especially worried about the balancing issue with k-d trees, but simply marking the nodes as already visited is a very simple and nice idea. Thanks for the implementations as well, though I plan to implement it myself, having something to lean up against while doing so is always nice. — NordCoder, Dec 17 '12 at 08:40

Efficient data structure for quality threshold clustering algorithm

1 Answers1