9

I am looking for an efficient algorithm for the following problem:

Given a set of points in 2D space, where each point is defined by its X and Y coordinates. Required to split this set of points into a set of clusters so that if distance between two arbitrary points is less then some threshold, these points must belong to the same cluster:

sample clusters

In other words, such cluster is a set of points which are 'close enough' to each other.

The naive algorithm may look like this:

  1. Let R be a resulting list of clusters, initially empty
  2. Let P be a list of points, initially contains all points
  3. Pick random point from P and create a cluster C which contains only this point. Delete this point from P
  4. For every point Pi from P 4a. For every point Pc from C 4aa. If distance(Pi, Pc) < threshold then add Pi to C and remove it from P
  5. If at least one point was added to cluster C during the step 4, go to step 4
  6. Add cluster C to list R. if P is not empty, go to step 3

However, naive approach is very inefficient. I wonder if there is a better algorithm for this problem?

P.S. I don't know the number of clusters apriori

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
ovk
  • 2,318
  • 1
  • 23
  • 30
  • 1
    If you compute a matrix where the [x][y] place is the distance between the point x and the point y. Then, iterate over it so that if the distance is less than the threshold, mark the place with a 1, otherwise a 0. Now we have a graph that we can use either DFS or BFS to find all the clusters. It may not be very efficient, but it reduces it to an easier problem – spyr03 Sep 06 '15 at 22:14
  • 2
    Whatever method you choose, make sure you compare the distance squared `(x2 - x1)^2 + (y2 - y1)^2` with the threshold squared, to avoid having to calculate square roots. – m69's been on strike for years Sep 06 '15 at 23:32
  • And look into storing the points in a space-partitioning tree. – m69's been on strike for years Sep 06 '15 at 23:44
  • sounds like a job for [linear discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)? – vzn Sep 28 '15 at 15:13

2 Answers2

7

There are some classic algorithms here:

  • Hierarchical Agglomerative Clustering
  • DBSCAN

that you should read and understand.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1
  1. Split up the space of points into a grid. This grid would have unit length equal to threshhold / sqrt(8).

  2. Iterate though the list of points P, adding each point to both the square it occupies and a new cluster. If a point is added to a square which already contains a point, add it to the cluster of the other point(s). I'll call the list of all occupied sqaures S.

  3. Now take any square from S and its cluster c. For each adjacent or diagonal square, combine the cluster of that square with c and remove the square from S. Repeat the process for all squares just added.

  4. Once no more adjacent sqaures can be found, the cluster is finished and can be added to C. Repeat step 3 with any remaining squares in S. When S is empty, you're finished.

crb233
  • 222
  • 1
  • 8
  • This is not right. A grid can help find candidates for points within a certain distance threshold, but it cannot replace an explicit check: any grid size will either be small enough that two points can be closer than the threshold without being in adjacent/diagonal squares, or big enough that two points can be in adjacent/diagonal squares without being closer than the threshold, or both. (Yours has the former problem.) – ruakh Sep 07 '15 at 01:33
  • Also, it's not obvious to me that this is faster than the OP's strategy, since the complexity of "combining" two clusters is non-obvious. BFS or DFS over occupied squares probably makes more sense. – ruakh Sep 07 '15 at 01:33
  • @ruakh It's clearly not optimal, but I think it would be faster than the OP's solution since it doesn't require checking every point against all non-clustered points. I did overlook the first problem however. Thanks for the comments – crb233 Sep 07 '15 at 02:17