6

My R program is as below:

hcluster <- function(dmatrix) {
    imatrix <- NULL
    hc <- hclust(dist(dmatrix), method="average")
    for(h in sort(unique(hc$height))) {
        hc.index <- c(h,as.vector(cutree(hc,h=h)))
        imatrix <- cbind(imatrix, hc.index)
    }
    return(imatrix)
}

dmatrix_file = commandArgs(trailingOnly = TRUE)[1]
print(paste('Reading distance matrix from', dmatrix_file))
dmatrix <- as.matrix(read.csv(dmatrix_file,header=FALSE))

imatrix <- hcluster(dmatrix)
imatrix_file = paste("results",dmatrix_file,sep="-")
print(paste('Wrinting results to', imatrix_file))
write.table(imatrix, file=imatrix_file, sep=",", quote=FALSE, row.names=FALSE, col.names=FALSE)
print('done!')

My input is a distance matrix (of course symmetric). When I execute above program with a distance matrix larger than about thousands records(Nothing happen for several hundreds), it gave me the error message:

Error in cutree(hc, h = h) : 
  the 'height' component of 'tree' is not sorted
(increasingly); consider applying as.hclust() first
Calls: hcluster -> as.vector -> cutree
Execution halted

My machine has about 16GB of RAMs and 4CPU, so it won't be the problem of resources.

Can anyone please let me know what's the problem? Thanks!!

Kevin
  • 2,191
  • 9
  • 35
  • 49
  • Naively implemented, hierarchical clustering has `O(n^3)` complexity (in fact, the known `O(n^2)` algorithms are only for some specialized versions, see `SLINK`, `CLINK`). It might indeed by an issue of complexity, although the error doesn't look like this. – Has QUIT--Anony-Mousse Feb 27 '12 at 06:54
  • 4
    I would like to go deeper, coud you post the sample of dmatrix_file and give directions how to scale up? – Petr Matousu Dec 11 '12 at 13:28
  • Agree with Peter - can't you make dmatrix_file available, or a dummy dataset of same dimensions? – geotheory Dec 13 '12 at 13:33

2 Answers2

6

I'm not much of an R wizard - but I ran into exactly this problem.

A potential answer is described here:

https://stat.ethz.ch/pipermail/r-help/2008-May/163409.html

plof
  • 1,294
  • 1
  • 9
  • 7
1

Looking at the cutree function here http://code.ohloh.net/file?fid=QM4q0tWQlv2VywAoSr2MfgcNjnA&cid=ki3UJjFJ8jA&s=cutree%20component%20of%20is%20not%20sorted&mp=1&ml=1&me=1&md=1&browser=Default#L1

You may try adding the k scaler for the number of groups, this will override the height argument. If not you may look at what hc$height is because if it is not a numeric, complex, character or logical vector, is.unsorted will return true and give you this error.

if(is.null(k)) {
    if(is.unsorted(tree$height))
        stop("the 'height' component of 'tree' is not sorted (increasingly)")
    ## h |--> k
    ## S+6 help(cutree) says k(h) = k(h+), but does k(h-) [continuity]
    ## h < min() should give k = n;
    k <- n+1L - apply(outer(c(tree$height,Inf), h, ">"), 2, which.max)
    if(getOption("verbose")) message("cutree(): k(h) = ", k, domain = NA)
}
Chris Hinshaw
  • 6,967
  • 2
  • 39
  • 65