Clustering 2d integer coordinates into sets of at most N points

Question

I have a number of points on a relatively small 2-dimensional grid, which wraps around in both dimensions. The coordinates can only be integers. I need to divide them into sets of at most N points that are close together, where N will be quite a small cut-off, I suspect 10 at most.

I'm designing an AI for a game, and I'm 99% certain using minimax on all the game pieces will give me a usable lookahead of about 1 move, if that. However distant game pieces should be unable to affect each other until we're looking ahead by a large number of moves, so I want to partition the game into a number of sub-games of N pieces at a time. However, I need to ensure I select a reasonable N pieces at a time, i.e. ones that are close together.

I don't care whether outliers are left on their own or lumped in with their least-distant cluster. Breaking up natural clusters larger than N is inevitable, and only needs to be sort-of reasonable. Because this is used in a game AI with limited response time, I'm looking for as fast an algorithm as possible, and willing to trade off accuracy for performance.

Does anyone have any suggestions for algorithms to look at adapting? K-means and relatives don't seem appropriate, as I don't know how many clusters I want to find but I have a bound on how large clusters I want. I've seen some evidence that approximating a solution by snapping points to a grid can help some clustering algorithms, so I'm hoping the integer coordinates makes the problem easier. Hierarchical distance-based clustering will be easy to adapt to the wrap-around coordinates, as I just plug in a different distance function, and also relatively easy to cap the size of the clusters. Are there any other ideas I should be looking at?

I'm more interested in algorithms than libraries, though libraries with good documentation of how they work would be welcome.

EDIT: I originally asked this question when I was working on an entry for the Fall 2011 AI Challenge, which I sadly never got finished. The page I linked to has a reasonably short reasonably high-level description of the game.

The two key points are:

Each player has a potentially large number of ants
Every ant is given orders every turn, moving 1 square either north, south, east or west; this means the branching factor of the game is O(4^ants).

In the contest there were also strict time constraints on each bot's turn. I had thought to approach the game by using minimax (the turns are really simultaneous, but as a heuristic I thought it would be okay), but I feared there wouldn't be time to look ahead very many moves if I considered the whole game at once. But as each ant moves only one square each turn, two ants cannot N spaces apart by the shortest route possibly interfere with one another until we're looking ahead N/2 moves.

So the solution I was searching for was a good way to pick smaller groups of ants at a time and minimax each group separately. I had hoped this would allow me to search deeper into the move-tree without losing much accuracy. But obviously there's no point using a very expensive clustering algorithm as a time-saving heuristic!

I'm still interested in the answer to this question, though more in what I can learn from the techniques than for this particular contest, since it's over! Thanks for all the answers so far.

How big is the grid ? Does 10 "close" mean all adjacent, or with gaps, e.g. groups on a Go board ? — denis, Nov 22 '11 at 18:16
The grid varies. I believe it's up to some hundreds. By close I mean ideally closer to each other than to pieces not in the partition. If there's only 10 pieces on the board a single partition can cover the whole board. But if there's a clump of 20, it has to be split into two groups of 10; there shouldn't be a group that includes some of the clump and some more distant pieces. — Ben, Nov 22 '11 at 20:43
I think you should post more details about the rules of the game and how the pieces interact with each other. Your assumption about needing to create these groups in the first place may be incorrect. — Fantius, Jan 08 '12 at 21:58
How many ants can there be? Doing `O(n^2)` setting up doesn't sound too bad, if it reduces the second step from `O(4^n)` to `O(4^sqrt(n))` or so. — Thomas Ahle, Jan 13 '12 at 12:11
@ThomasAhle The number of ants is technically unbounded. The home page for the contest http://ants.aichallenge.org/ has a flash visualisation of a random game; the one I just watched was approaching 300 ants. — Ben, Jan 13 '12 at 12:30
Isn't even the map sizes bounded? 300^2 is not a lot of work to do in half a second. — Thomas Ahle, Jan 13 '12 at 13:58
Yeah, the map size is bounded in practice. 4^300^lookahead is a lot of work though. O(n^2) preprocessing would be very good, if it allowed me to turn that into 30*4^10^lookahead. — Ben, Jan 13 '12 at 14:06
@Ben, any thoughts regarding choosing an answer for the bounty? — cyborg, Jan 13 '12 at 17:21

gordy · Answer 1 · 2013-06-29T04:27:41.147

6

The median-cut algorithm is very simple to implement in 2D and would work well here. Your outliers would end up as groups of 1 which you could discard or whatever.

Further explanation requested: Median cut is a quantization algorithm but all quantization algorithms are special case clustering algorithms. In this case the algorithm is extremely simple: find the smallest bounding box containing all points, split the box along its longest side (and shrink it to fit the points), repeat until the target amount of boxes is achieved.

A more detailed description and coded example

Wiki on color quantization has some good visuals and links

edited Jun 29 '13 at 04:27

answered Jan 07 '12 at 21:33

gordy

9,360
1
31
43

1

Awarded bounty for simplicty and for pointing an existing algorithm. The requirement for "sets of at most N points" is not explicitly met, but it could be adjusted for it. – cyborg Jan 14 '12 at 16:58
Care to explain the algorithm here or at least provide some reference to it? – Ivo Flipse Jun 28 '13 at 08:05
@IvoFlipse added brief explanation and some links – gordy Jun 29 '13 at 04:28

score 4 · Answer 2 · answered Jan 08 '12 at 13:14

4

Since you are writing a game where (I assume) only a constant number of pieces move between each clusering, you can take advantage of a Online algorithm to get consant update times.

The property of not locking yourself to a number of clusters is called Nonstationary, I believe.

This paper seams to have a good algorithm with both of the above two properties: Improving the Robustness of 'Online Agglomerative Clustering Method' Based on Kernel-Induce Distance Measures (You might be able to find it elsewhere as well).

Here is a nice video showing the algorithm in works: enter image description here

answered Jan 08 '12 at 13:14

Thomas Ahle

30,774
21
92
114

It seems like it doesn't meet the requirement for a cap on the number of points per cluster. – cyborg Jan 08 '12 at 15:06
No probably not a clear cap, but if you tweak the parameters you should be able to get below any N if more points are not literally on top of each other. – Thomas Ahle Jan 09 '12 at 00:00

cyborg · Answer 3 · 2012-01-07T13:39:22.420

Construct a graph G=(V, E) over your grid, and partition it. Since you are interested in algorithms rather than libraries, here is a recent paper:

Daniel Delling, Andrew V. Goldberg, Ilya Razenshteyn, and Renato F. Werneck. Graph Partitioning with Natural Cuts. In 25th International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, 2011. [PDF]

From the text:

The goal of the graph partitioning problem is to ﬁnd a minimum-cost partition P such that the size of each cell is bounded by U.

So you will set U=10.

score 1 · Answer 4 · edited Apr 13 '17 at 12:44

You can calculate a minimum spanning tree and remove the longest edges. Then you can calculate the k-means. Remove another long edge and calculate the k-means. Rinse and repeat until you have N=10. I believe this algorithm is named single-link k-means and the cluster are similar to voronoi diagrams:

"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges."

See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering

mcdowella · Answer 5 · 2012-01-14T16:41:54.920

Consider the case where you only want two clusters. If you run k-means, then you will get two points, and the division between the two clusters is a plane orthogonal to the line between the centres of the two clusters. You can find out which cluster a point is in by projecting it down to the line and then comparing its position on the line with a threshold (e.g. take the dot product between the line and a vector from either of the two cluster centres and the point).

For two clusters, this means that you can adjust the sizes of the clusters by moving the threshold. You can sort the points on their distance along the line connecting the two cluster centres and then move the threshold along the line quite easily, trading off the inequality of the split with how neat the clusters are.

You probably don't have k=2, but you can run this hierarchically, by dividing into two clusters, and then sub-dividing the clusters.

(After comment)

I'm not good with pictures, but here is some relevant algebra.

With k-means we divide points according to their distance from cluster centres, so for a point Xi and two centres Ai and Bi we might be interested in

SUM_i (Xi - Ai)^2 - SUM_i(Xi - Bi)^2

This is SUM_i Ai^2 - SUM_i Bi^2 + 2 SUM_i (Bi - Ai)Xi

So a point gets assigned to either cluster depending on the sign of K + 2(B - A).X - a constant plus the dot product between the vector to the point and the vector joining the two cluster circles. In two dimensions, the dividing line between the points on the plane that end up in one cluster and the points on the plane that end up in the other cluster is a line perpendicular to the line between the two cluster centres. What I am suggesting is that, in order to control the number of points after your division, you compute (B - A).X for each point X and then choose a threshold that divides all points in one cluster from all points in the other cluster. This amounts to sliding the dividing line up or down the line between the two cluster centres, while keeping it perpendicular to the line between them.

Once you have dot products Yi, where Yi = SUM_j (Bj - Aj) Xij, a measure of how closely grouped a cluster is is SUM_i (Yi - Ym)^2, where Ym is the mean of the Yi in the cluster. I am suggesting that you use the sum of these values for the two clusters to tell how good a split you have. To move a point into or out of a cluster and get the new sum of squares without recomputing everything from scratch, note that SUM_i (Si + T)^2 = SUM_i Si^2 + 2T SUM_i Si + T^2, so if you keep track of sums and sums of squares you can work out what happens to a sum of squares when you add or subtract a value to every component, as the mean of the cluster changes when you add or remove a point to it.

@modowella: How do you want to sub-divide the clusters? Into half? — Micromega, Jan 12 '12 at 10:03
At each stage, I would look at how good the possible splits were, and go for the best sensible split. If N=10 and I had 100 points, I might check 10/90, 20/80, .. 80/20, 90/100 and perhaps a few close to these like 11/89. One measure of how good a split was might be the sum of the squared distances along the cluster centre line from each point to the centre of its split. I don't have any theory to back this up, so you might be best to try a number of possibilities on real data and see if any do particularly well or badly. — mcdowella, Jan 12 '12 at 18:51
@modowella: It looks like a treemap algorithm. Maybe a r-tree? — Micromega, Jan 12 '12 at 18:59
I don't think comparisons with r-trees are particularly useful, because r-trees are designed for things like nearest neighbour searches, which are not required here. It reminds me a bit of some of the http://en.wikipedia.org/wiki/Decision_tree_learning algorithms, but they tend not to be built on k-means. Divisive clustering with k-means or other re-splitting algorithms is a known method in its own right, although not a particular popular one. — mcdowella, Jan 13 '12 at 05:16
Thank you very much for your fast reply. I didn't meant r-tree but a kd-tree. Is this measure a linear thing? — Micromega, Jan 14 '12 at 07:54
If you mean the measure I suggested to choose between possible clusterings, AFAIK it is a non-linear function, since it is a sum of squared distances to cluster centres. However, you should be able to compute it for all A/B splits in linear time, if you do enough algebra to work out the change in value when one point changes group. — mcdowella, Jan 14 '12 at 13:01
@modowella: An image would be nice to clarif your idea. I still don't get it. — Micromega, Jan 14 '12 at 13:24
I'm not good with pictures, but I've added more explanation to the answer. — mcdowella, Jan 14 '12 at 16:42

Clustering 2d integer coordinates into sets of at most N points

5 Answers5