I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:
- Initialize the threshold distance allowed for clusters and the minimum cluster size
- Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
- Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
- Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size
I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.
I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.
Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).