I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).
Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.
Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).
I have tried multiple approaches:
- K-means
- Most should know how this works
- Average spare capacity of 32
- Runs in O(n^2)
- Sorted list of a-b distances
- I tried an alternative approach like so:
- Initialise cluster points by randomly selecting points from the data
- Determine the distance matrix between every point and every cluster
- Flatten it into a list
- Sort the list
- Go from smallest to longest distance assigning points to clusters
- Assign clusters points until they reach 64, then no more can be assigned
- Stop iterating through the list once all points have been assigned
- Update the cluster centroid based on the assigned points
- Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
- Collect cluster locations that are nearby into one cluster
- This had an average spare capacity of approximately 0, by design
- This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list
O(NlogN)
and then iterate over it until all points have been assignedO(NK)
and then repeat that until convergence
- I tried an alternative approach like so:
- Linear Programming
- This was quite simple to implement using libraries, but also took far too long again because of the complexity
I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.
Let me know if I missed any information out.