6

I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).

Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.

Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).

I have tried multiple approaches:

  • K-means
    • Most should know how this works
    • Average spare capacity of 32
    • Runs in O(n^2)
  • Sorted list of a-b distances
    • I tried an alternative approach like so:
      1. Initialise cluster points by randomly selecting points from the data
      2. Determine the distance matrix between every point and every cluster
      3. Flatten it into a list
      4. Sort the list
      5. Go from smallest to longest distance assigning points to clusters
      6. Assign clusters points until they reach 64, then no more can be assigned
      7. Stop iterating through the list once all points have been assigned
      8. Update the cluster centroid based on the assigned points
      9. Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
      10. Collect cluster locations that are nearby into one cluster
    • This had an average spare capacity of approximately 0, by design
    • This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list O(NlogN) and then iterate over it until all points have been assigned O(NK) and then repeat that until convergence
  • Linear Programming
    • This was quite simple to implement using libraries, but also took far too long again because of the complexity

I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.

Let me know if I missed any information out.

Adam Dadvar
  • 384
  • 1
  • 7

2 Answers2

2

Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.

Old pre-question-edit answer below.


You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.

The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.

To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Interesting approach, I'm not entirely sure I understand the latter parts of your implementation - but is this not just aglomerative/hierarchical clustering maybe with some added parts about odd and even values? – Adam Dadvar Feb 11 '19 at 12:58
  • @AdamDadvar Don't worry about it too much; that was the fastest algorithm I could think of that wouldn't construct a matching with edges that cross. I'll ponder your new constraint. – David Eisenstat Feb 11 '19 at 13:06
  • The data is already split into superclusters to some extent, each of size ~50k points, sorry I also forgot to mention that - will edit original post again. It's an interesting idea though, will look into integer programming with some of the cost data. – Adam Dadvar Feb 11 '19 at 13:54
-1

Let p = ceil(N/64).

That is the optimum number of equipment.

Let s = ceil(sqrt(p)).

Sort the data by the x axis. Slice the data into slices of 64*s entries each (but the last slide).

In each slice, sort the data by the y axis. Take 64 objects each and assign them to one equipment. It's easy to see that all but possibly the last equipment are optimally used, and close.

Sorting is so incredibly cheap that this will be extremely fast. Give it a try, and you'll likely be surprised by the quality vs. runtime trade-off! I wouldn't be surprised if it finds competitive results to most that you tried except the LP approach, and it will run in just few seconds.

Alternatively: sort all objects by their Hilbert curve coordinate. Partition into p partitions, assign one equipment each.

The second one is much harder to implement and likely slower. It can sometimes be better, but also sometimes worse.

If distance is more important to you, try the following strategy: build a spatial index (e.g., k-d-tree, or if you have Haversine, a R*-tree). For each point, find the 63 nearest neighbors and store this. Sort by distance, descending. This will give you a "difficulty" score. Now don't put equipment at the most difficult point, but nearby - at it's neighbor with the smallest max(distance to the difficult point, distance to it's 63 nearest neighbor). Repeat this for a few points, but after about 10% of the data, begin again the entire procedure with the remaining points. The problem is that you didn't well specify when to prefer keeping the distances small, even when using more equipment... You could incorporate this, by only considering neighbors within a certain bound. The point with the fewest neighbors within the bound is then the hardest; and it's best covered by a neighbor with the most uncovered points within the bound etc.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I was very skeptical about this approach - when I implemented it I found that the clustering was *okay* for dense locations, but in sparser regions of points the clustering was horrible. I suppose that's as expected, as we've segmented the data into vertical chunks first, so sparse points miss out on their closest neighbours very frequently with this approach. – Adam Dadvar Feb 11 '19 at 14:47
  • You can add a refinement step that allows "trading" points with neighbors. In particular if you allow are some slack (i.e. make them only use 60 of 64 initially). But then capacity is no longer optimally used. – Has QUIT--Anony-Mousse Feb 12 '19 at 01:42
  • Btw, the first strategy, when implemented a intended, actually tries to cut the data into *squares*. I'd the input data is not very square, you may need to vary the s value to make more slices (wide input, all of the US) or fewer (tall input, say, California). But yes, if there are gaps, points can be assigned in undesired ways (say Hawaii and California being connected) – Has QUIT--Anony-Mousse Feb 12 '19 at 07:39
  • But the bottom strategy may be more what you intended, with a threshold on the distance. Otherwise, any isolated location will be covered by some other place, at a large distance. – Has QUIT--Anony-Mousse Feb 12 '19 at 08:11