11

I have fitted a k-means algorithm on 5000+ samples using the python scikit-learn library. I want to have the 50 samples closest to a cluster center as an output. How do I perform this task?

Archie
  • 2,247
  • 1
  • 18
  • 35
Nipun Alahakoon
  • 2,772
  • 5
  • 27
  • 45

3 Answers3

16

If km is the k-means model, the distance to the j'th centroid for each point in an array X is

d = km.transform(X)[:, j]

This gives an array of len(X) distances. The indices of the 50 closest to centroid j are

ind = np.argsort(d)[::-1][:50]

so the 50 points closest to the centroids are

X[ind]

(or use argpartition if you have a recent enough NumPy, because that's a lot faster).

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 5
    Why the "-1" after your argsort? Since you want the shortest distance and argsort defaults to ascending, shouldn't you omit this? – mdubez May 02 '16 at 19:11
  • 4
    The "-1" in argsort is unnecessary and reverses the order as pointed out by @mdubez – optimist Jun 27 '17 at 07:19
11

One correction to the @snarly's answer.

after performing d = km.transform(X)[:, j], d has elements of distances to centroid(j), not similarities.

so in order to give closest top 50 indices, you should remove '-1', i.e.,

ind = np.argsort(d)[::][:50]

(normally, d has sorted score of distance in ascending order.)

Also, perhaps the shorter way of doing

ind = np.argsort(d)[::-1][:50] could be

ind = np.argsort(d)[:-51:-1].

JUNPA
  • 220
  • 2
  • 6
0

If you have the distance to center values in a list, you can use sort.

results = [(distance_to_center, (x, y)), (distance_to_center, (x, y)), ...]
results.sort()
# get closest 50
closest_fifty = results[:50]
monkut
  • 42,176
  • 24
  • 124
  • 155