I have fitted a k-means algorithm on 5000+ samples using the python scikit-learn library. I want to have the 50 samples closest to a cluster center as an output. How do I perform this task?
3 Answers
If km
is the k-means model, the distance to the j
'th centroid for each point in an array X
is
d = km.transform(X)[:, j]
This gives an array of len(X)
distances. The indices of the 50 closest to centroid j
are
ind = np.argsort(d)[::-1][:50]
so the 50 points closest to the centroids are
X[ind]
(or use argpartition
if you have a recent enough NumPy, because that's a lot faster).

- 355,277
- 75
- 744
- 836
-
5Why the "-1" after your argsort? Since you want the shortest distance and argsort defaults to ascending, shouldn't you omit this? – mdubez May 02 '16 at 19:11
-
4The "-1" in argsort is unnecessary and reverses the order as pointed out by @mdubez – optimist Jun 27 '17 at 07:19
One correction to the @snarly's answer.
after performing d = km.transform(X)[:, j]
,
d
has elements of distances to centroid(j)
, not similarities.
so in order to give closest top 50 indices
, you should remove '-1', i.e.,
ind = np.argsort(d)[::][:50]
(normally, d has sorted score of distance in ascending order.)
Also, perhaps the shorter way of doing
ind = np.argsort(d)[::-1][:50]
could be
ind = np.argsort(d)[:-51:-1]
.

- 220
- 2
- 6
If you have the distance to center values in a list, you can use sort.
results = [(distance_to_center, (x, y)), (distance_to_center, (x, y)), ...]
results.sort()
# get closest 50
closest_fifty = results[:50]

- 42,176
- 24
- 124
- 155