Weighted clustering with pycluster

Question

I've managed to adopt a code snippet for how to use PyCluster's k-means clustering algorithm. I was hoping to be able to weight the data points, but unfortunately, I can only weigh the features. Am I missing something or is there maybe a trick I can use to make some of the points count more than others?

import numpy as np
import Pycluster as pc

points = np.asarray([
    [1.0, 20, 30, 50],
    [1.2, 15, 34, 50],
    [1.6, 13, 20, 55],
    [0.1, 16, 40, 26],
    [0.3, 26, 30, 23],
    [1.4, 20, 28, 20],
])

# would like to specify 6 weights for each of the elements in `points`
weights = np.asarray([1.0, 1.0, 1.0, 1.0])

clusterid, error, nfound = pc.kcluster(
    points, nclusters=2, transpose=0, npass=10, method='a', dist='e', weight=weights
)
centroids, _ = pc.clustercentroids(points, clusterid=clusterid)
print centroids

possible duplicate of [Weighting k Means Clustering by number of observations](http://stackoverflow.com/questions/27017349/weighting-k-means-clustering-by-number-of-observations) — Prune, Sep 22 '15 at 15:55

score 1 · Answer 1 · answered Apr 04 '20 at 11:27

1

Nowadays you can use the sample_weights in sklearn's fit method. Here's an example.

answered Apr 04 '20 at 11:27

scc

10,342
10
51
65

Do you have any suggestion how to determine/calculate optimum sample_weight to be assigned to each feature? – zlatko Jan 05 '21 at 22:39

Prune · Answer 2 · 2015-09-22T22:16:50.910

0

Weighting the individual data points is not a feature of the KMeans algorithm. This is in the algorithm definition: it's not available in pycluster, MLlib, or TrustedAnalytics.

You can, however, add duplicate data points. For instance, if you want that second data point to count twice as much, alter your list to read:

points = np.asarray([
    [1.0, 20, 30, 50],
    [1.2, 15, 34, 50],
    [1.2, 15, 34, 50],
    [1.6, 13, 20, 55],
    [0.1, 16, 40, 26],
    [0.3, 26, 30, 23],
    [1.4, 20, 28, 20],
])

edited Sep 22 '15 at 22:16

answered Sep 21 '15 at 22:43

Prune

76,765
14
60
81

I'm not entirely sure how I solved this problem (it's been a while), but I think I multiplied the weight with the distance to the centroid which worked fine. – orange Sep 22 '15 at 08:57
Are you trying to *use* the algorithm, or are you writing your own implementation? – Prune Sep 22 '15 at 22:17
If you're writing your own, then it's pretty simple, as you said: add a column of weights for the points. On each iteration, you multiply that by the distance to the centrum, a relatively small time addition compared to the root-sum-square operation. Is there a question still open on this for you? – Prune Sep 23 '15 at 15:47

Weighted clustering with pycluster

2 Answers2