0

I have many points (latitudes and longitudes) on a plane (a city) and I want to find two clusters. Cluster 1 is points cluttered close together and Cluster 2 is everything else.

I know the definition of the problem is not exact. The only thing defined is that I need exactly 2 clusters. Out of N points, how many end up in cluster 1 or cluster 2 is not defined.

The main aim is to identify points which are very close to each other and separate them from the rest (which are more more evenly spread out)

The best I can think of is the following algorithm:

1. For each point, Calculate the sum of the square distances to all other points.
2. Run the k-means with k=2 on these square distances

The squaring (or maybe even higher order) of the distance should help by raising the dimensionality. However this algorithm will be biased towards points near the center of the city. It will struggle to find clusters at the edges of the city.

Any suggestions on how to avoid this problem? And any other suggestions to improve this algorithm

arahant
  • 2,203
  • 7
  • 38
  • 62
  • 2
    maybe this one helps you: http://stackoverflow.com/questions/17112719/how-can-i-find-the-center-of-a-cluster-of-data-points – Regenschein Jul 10 '13 at 12:46
  • Does the cluster 1 need to be a connected space ? In sens of streets (without points of cluster 2), or alignement of points(without points of cluster 2) of something like that ? – Galigator Jul 10 '13 at 12:47
  • @fordprefect Wow, I was hoping it would be simpler than the description provided in the answer. Since I am no mathematician, it will take me some time to grasp what is being said. – arahant Jul 10 '13 at 13:36
  • @MrK - I am not sure I understand, these are just points and there is no concept of edges between these points. However if you wish, you can consider the entire graph as completely connected. – arahant Jul 10 '13 at 13:38

2 Answers2

1

i'd suggest something along the following lines:

key concept

count number of neighbouring points at distance less than a given value.

semiformal description

  1. count number nc(P) of neighbouring points at distance less than a given value d_cutoff for each point P.
  2. cluster all points P_i with nc(P_i) greater than a threshold thres_count into cluster #1.
  3. for each P_i in cluster #1 add its close neighbors, i.e. points Q with d(Q, P_i) < d_cutoff to the very same cluster #1.
  4. set cluster #2 to the complement of cluster #1.

algorithmic angle

  1. build an undirected graph G=(V, E) with your points being the vertex set V and an edge between every pair of points at a distance less than d_cutoff from each other.
  2. delete all edges e=(v,w) from the graph where deg(v) < thres_count and deg(w) < thres_count.
  3. G's isolated vertices form cluster #2, the complement is cluster #1.

heuristic on how to choose d_cutoff

build a minimum spanning tree (mst) of your point set. the frequency distribution of edge lengths should hint at suitable cutoff values. short pairwise distances will be incorporated into the mst first. thus there should be at least one pronounced gap in the ordered sequence of edge lengths for point sets with a natural clustering. so partition the set of mst edge lengths into a small number of adjacent intervals, ordering these intervals in the natural way. count how many actual distance values fall into each interval. consider the map between an interval's ordinal number and its count of distance values. large deltas between functions values for successive arguments would suggest to take the upper bound of distances in the lower interval as d_cutoff.

collapsar
  • 17,010
  • 4
  • 35
  • 61
  • Thanks for the answer. The algorithm will definitely get rid of the problem of my algorithm being biased towards the points in the center. The algorithm looks like a good approximation. Just that the trouble would be choosing d_cutoff and thres_count. Specifically choosing d_cutoff will be tough. Too low and we might not get enough points in the cluster. Too high and we might chain in too many points in step 3 (of your semiformal description) – arahant Jul 10 '13 at 13:51
  • @arahant: i've added a section to the answer adressing the issue. – collapsar Jul 10 '13 at 15:37
1

Since points in cluster 1 are close to each other, I think a density-based clustering algorithm may help. You may try the OPTICS algorithm, which is similar to DBSCAN but is aware of varying density and the number of clusters can be specified by the user.

fuch
  • 156
  • 1
  • 3