4

I've managed to adopt a code snippet for how to use PyCluster's k-means clustering algorithm. I was hoping to be able to weight the data points, but unfortunately, I can only weigh the features. Am I missing something or is there maybe a trick I can use to make some of the points count more than others?

import numpy as np
import Pycluster as pc

points = np.asarray([
    [1.0, 20, 30, 50],
    [1.2, 15, 34, 50],
    [1.6, 13, 20, 55],
    [0.1, 16, 40, 26],
    [0.3, 26, 30, 23],
    [1.4, 20, 28, 20],
])

# would like to specify 6 weights for each of the elements in `points`
weights = np.asarray([1.0, 1.0, 1.0, 1.0])

clusterid, error, nfound = pc.kcluster(
    points, nclusters=2, transpose=0, npass=10, method='a', dist='e', weight=weights
)
centroids, _ = pc.clustercentroids(points, clusterid=clusterid)
print centroids
orange
  • 7,755
  • 14
  • 75
  • 139
  • possible duplicate of [Weighting k Means Clustering by number of observations](http://stackoverflow.com/questions/27017349/weighting-k-means-clustering-by-number-of-observations) – Prune Sep 22 '15 at 15:55

2 Answers2

1

Nowadays you can use the sample_weights in sklearn's fit method. Here's an example.

scc
  • 10,342
  • 10
  • 51
  • 65
  • Do you have any suggestion how to determine/calculate optimum sample_weight to be assigned to each feature? – zlatko Jan 05 '21 at 22:39
0

Weighting the individual data points is not a feature of the KMeans algorithm. This is in the algorithm definition: it's not available in pycluster, MLlib, or TrustedAnalytics.

You can, however, add duplicate data points. For instance, if you want that second data point to count twice as much, alter your list to read:

points = np.asarray([
    [1.0, 20, 30, 50],
    [1.2, 15, 34, 50],
    [1.2, 15, 34, 50],
    [1.6, 13, 20, 55],
    [0.1, 16, 40, 26],
    [0.3, 26, 30, 23],
    [1.4, 20, 28, 20],
])
Prune
  • 76,765
  • 14
  • 60
  • 81
  • I'm not entirely sure how I solved this problem (it's been a while), but I think I multiplied the weight with the distance to the centroid which worked fine. – orange Sep 22 '15 at 08:57
  • Are you trying to *use* the algorithm, or are you writing your own implementation? – Prune Sep 22 '15 at 22:17
  • If you're writing your own, then it's pretty simple, as you said: add a column of weights for the points. On each iteration, you multiply that by the distance to the centrum, a relatively small time addition compared to the root-sum-square operation. Is there a question still open on this for you? – Prune Sep 23 '15 at 15:47