2

I have N dots. I know all pairwise distances for them. I need to select K dots from them, such that average pairwise distance will be maximal. I have only dummy idea to iterate trough each dot.

Do you have more smart idea how to obtain such a subset?

It will be nice to solve this problem generally, without any assumptions, but if it will helps: N is around 10^3-10^4, K around 10^2.

My dummy idea: I start from dot #1 search for the most distant dot, so i have a chunk of 2 dot, next i search for the third dot, which has the biggest average distance to these 2 dots, and so on until i will collect K dots. This procedure should be repeated for all N dots as starting value. Finally I will obtain N arrays of K dots. From them I may choose one with biggest average distance.

zlon
  • 812
  • 8
  • 24
  • Could you explain your dummy idea a little further? – SaiBot Dec 13 '17 at 09:03
  • Just to iterate all dots finding maximal distance on each step. – zlon Dec 13 '17 at 09:14
  • So you mean you start with a random dot (point), iterate through all dots and search for the one furthest to the first dot, right? Then you pick this one and repeat until you have K dots? – SaiBot Dec 13 '17 at 09:18
  • Add explanation to question – zlon Dec 13 '17 at 09:23
  • If you need the maximum or minimum of something, then you'll need to go through all of your data. There's no escaping that, because if you have the last single point very far from all the others, that's going to change the distance drastically. As an alternative, @SaiBot is right in suggesting clustering but as he said, that's an approximation – ChatterOne Dec 13 '17 at 12:08

2 Answers2

1

I am not a 100% sure but this sounds like an NP-Hard problem.

As an approximation you could perform K-Median Clustering and return the resulting cluster representatives as your result. Clustering basically tries to
minimize distances of points belonging to the same cluster (which are not important for you) and maximize distances between points from different clusters (this is what you want).

Edit:

Thinking a little longer about this, I think, that you would want to try to search for points on the outer limit of the dataset in order to maximize average pairwise distance. Thus, you could either compute the (convex) hull and pick points from there, maybe recalculating the hull whenever you pick a point. Or you could start with a point (a), search for the furthest point (b) from (a), then search for the furthest point (c) from (b) and so on (avoiding to pick points twice, maybe by removing them after you picked them). This will ensure that you pick points at the border of your dataset.

SaiBot
  • 3,595
  • 1
  • 13
  • 19
  • I don't think the second part of your edit actually works. The point farthest from (b) could be a point that's really close to (a), while another one which is maybe a bit closer to (b) but much farther away from (a) would yield a higher average distance. – ChatterOne Dec 13 '17 at 14:06
  • I agree, the dummy algorithm proposed by OP is probably better – SaiBot Dec 13 '17 at 14:07
1

This could be regarded as a special case of the heaviest k-subgraph problem on the complete graph with N vertices, where the weights satisfy a triangle inequality.

The general case of the problem is NP-hard, and I am guessing that the restrictions above are not sufficient to make it polynomial, though they certainly could admit some heuristics.

For an approximate solution, have you evaluated a greedy solution, or a greedy-up-to-random sampling style solution?

Addendum

I recently came across a paper discussing the maximal dispersion and heaviest subgraph problem in the case where the weights satisfy the triangle inequality:

Hassin et al., Approximation algorithms for maximal dispersion, Operations Research Letters 21 (1997), no. 3, 133–137, DOI 10.1016/S0167-6377(97)00034-5.

Section 3 of this paper provides a simple approximate O(n2) greedy algorithm for the case which corresponds to this question, and the authors prove that the result is at least 1/2 of maximal.

halfflat
  • 1,584
  • 8
  • 11
  • I should read wiki. In Your response I understand only articles :(. – zlon Dec 13 '17 at 14:43
  • Sorry! By the triangle inequality, I'm just refering to how these weights come from the distance between the vertices, and so if u, v, and w are three points, and d is the distance function, d(u,w) <= d(u,v) + d(v,w). You could potentially use this when searching for a better solution given a starting solution: replacing point u with point v could increase the sum of the distances by at most d(u,v)*(k-1), for example. – halfflat Dec 13 '17 at 15:38
  • By greedy-up-to-random sampling, I mean: build up the set of points one by one by picking the point that increases the sum of distances the most, but as this is expensive to compute, try picking points from a randomly chosen subset, or find the point that increases the sum of distances to a randomly chosen subset of the points you already have in hand. – halfflat Dec 13 '17 at 15:40