How to calculate the centroids in k-means++ by using distances?

Question

I am using the k-means++ clusterer from Apache Commons Math in a interactive genetic algorithm to reduce the number of individuals that are evaluated by the user.

Commons Math makes it very easy to use. The user only needs to implement the Clusterable interface. It has two methods:

double distanceFrom(T p) which is quite clear and T centroidOf(Collection<T> p), which lets the user pick the centroid of a cluster.

If used on euclidean points, the centroid is very easy to calculate. But on chromosomes it is quite difficult, because their meaning is not always clear.

My question: Is there a efficient generic way to pick the centroid, not depending on the problem domain? (E.g. by using the distance)

EDIT

Ok, here is now my code for the centroid calculation. The idea: The point that has the lowest total distance to all other points is the nearest to the centroid.

public T centroidOf(Collection<T> c) {
  double minDist = Double.MAX_VALUE;
  T minP = null;

  // iterate through c
  final Iterator<T> it = c.iterator();
  while (it.hasNext()) {
    // test every point p1
    final T p1 = it.next();
    double totalDist = 0d;
    for (final T p2 : c) {
      // sum up the distance to all points p2 | p2!=p1
      if (p2 != p1) {
        totalDist += p1.distanceFrom(p2);
      }
    }

    // if the current distance is lower that the min, take it as new min
    if (totalDist < minDist) {
      minDist = totalDist;
      minP = p1;
    }
  }
  return minP;
}

cyborg · Accepted Answer · 2012-02-03T12:59:16.897

1

k-means requires an averaging metric (e.g., Euclidean). Without defining such a metric and space, you don't even know whether the average of points is actually a point inside the space.

You could, however, use k-medoids, which considers only the original points as candidates for medoids (while k-means finds means/centroids which are not necessarily on the original points). The algorithm looks for points which minimize pairwise dissimilarities (i.e., distanceFrom).

edited Feb 03 '12 at 12:59

answered Feb 01 '12 at 23:20

cyborg

9,989
4
38
56

Thanks for the hint. I want to use a point of the population as centroid without creating new points. But I also want to use this implementation. The only question is how to implement the `centroidOf()` method? At the moment I am selecting a point of the collection randomly. – Stephan Feb 02 '12 at 01:01
There is an algorithm in the link. – cyborg Feb 02 '12 at 05:15
I accept the answer because of your link. The desired implementation is now shown in the original question. – Stephan Feb 03 '12 at 12:13

How to calculate the centroids in k-means++ by using distances?

1 Answers1