14

I'm doing kmeans clustering in R with two requirements:

  1. I need to specify my own distance function, now it's Pearson Coefficient.

  2. I want to do the clustering that uses average of group members as centroids, rather some actual member. The reason for this requirement is that I think using average as centroid makes more sense than using an actual member since the members are always not near the real centroid. Please correct me if I'm wrong about this.

First I tried the kmeans function in stat package, but this function doesn't allow custom distance method.

Then I found pam function in cluster package. The pam function does allow custom distance metric by taking a dist object as parameter, but it seems to me that by doing this it takes actual members as centroids, which is not what I expect. Since I don't think it can do all the distance computation with just a distance matrix.

So is there some easy way in R to do the kmeans clustering that satisfies both my requirements ?

Amro
  • 123,847
  • 25
  • 243
  • 454
Derrick Zhang
  • 21,201
  • 18
  • 53
  • 73
  • 1
    You can use `vegan::designdist` to create your own index (also see `vegan::vegdist` if it's already there). After you have your `dist` object, you can use `hclust` in stats package to use your appropriate method of aggregation. – Roman Luštrik Sep 23 '11 at 05:35
  • 1
    @RomanLuštrik, thanks for commenting. I know how to specify distance metric with hclust, but now I need to know how to do it with kmeans. – Derrick Zhang Sep 23 '11 at 08:59

1 Answers1

17

Check the flexclust package:

The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation.

The package also includes a function distCor:

R> flexclust::distCor
function (x, centers) 
{
    z <- matrix(0, nrow(x), ncol = nrow(centers))
    for (k in 1:nrow(centers)) {
        z[, k] <- 1 - .Internal(cor(t(x), centers[k, ], 1, 0))
    }
    z
}
<environment: namespace:flexclust>
rcs
  • 67,191
  • 22
  • 172
  • 153
  • Thanks rcs ! Why do I get "incompatible dimension" error when I specify dist as : family=kccaFamily(dist=function(x, y) { 1 - cor(x,y) }) ? – Derrick Zhang Sep 24 '11 at 01:46
  • You need a function with arguments `x` and `centers`. See for instance the source code of `flexclust::distCor` – rcs Sep 25 '11 at 15:28
  • 1
    In case anyone is confused on how to use distCor, try: `res = kcca(data, 10, family=kccaFamily(dist=distCor))` – Dolan Antenucci Feb 06 '13 at 20:16
  • 2
    As an R rookie, it also took me a while to figure out how to see what attributes `res` had (use `attributes(res)` to determine, and `attr(res, 'second')` to access one. – Dolan Antenucci Feb 07 '13 at 15:12