2

I would like to do hierarchical clustering by row and then by column. I came up with this total hack of a solution:

#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]

mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)

mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]

write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)

The variables mtxf.in and mtxf.out represent the input matrix and clustered output matrix files, respectively. The variable clusterMethod is one of the hclust methods, such as single, average, etc.

As an example input, here's a data matrix:

topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
row1    0       3       0       0       0       3
row2    6       6       6       6       6       6
row3    0       3       0       0       0       3
row4    6       6       6       6       6       6
row5    0       3       0       0       0       3
row6    0       3       0       0       0       3

Running this script, I lose my top-left corner element from mtxf.in. Here's the output that comes out of this script:

col5    col4    col1    col3    col2    col6
row6    0       0       0       0       3       3
row5    0       0       0       0       3       3
row1    0       0       0       0       3       3
row3    0       0       0       0       3       3
row2    6       6       6       6       6       6
row4    6       6       6       6       6       6

My questions: In addition to looking for a way to preserve the original structure of the input matrix file, I also don't know how much memory this consumes or whether there are faster and cleaner, more "R"-like ways for doing this.

Is it really this hard to cluster by rows and columns in R? Are there constructive ways to improve this script? Thanks for your advice.

Andrie
  • 176,377
  • 47
  • 447
  • 496
Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • There is actually a site for code review specifically now, might be worth trying there aswell/instead. http://codereview.stackexchange.com/ – Jess Oct 05 '11 at 02:32
  • 3
    I'd actually say you're more likely to get R-specific help here. – joran Oct 05 '11 at 02:55

2 Answers2

5

Once you have your data cleaned (i.e. removed the first column), this really just requires three lines of code:

Clean data (assign row names from first column, then remove first column):

dat <- mtfx.in
rownames(dat) <- dat[, 1]
dat <- dat[, -1]

Cluster and reorder:

row.order <- hclust(dist(dat))$order
col.order <- hclust(dist(t(dat)))$order

dat[row.order, col.order]

Results:

     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6
Andrie
  • 176,377
  • 47
  • 447
  • 496
0

I'll be honest I'm not totally clear on why you're doing some of the stuff your're doing, so it's entirely possible I've misunderstood what you're looking for. If I'm way off base, let me know and I'll delete this answer.

But I suspect that your life will be much easier (and your results actually correct) if you read your data in using row.names = 1 to indicate that the first column are actually row names. For example:

#Read the data in
d1 <- read.table(textConnection("topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
 row1    0       3       0       0       0       3
 row2    6       6       6       6       6       6
 row3    0       3       0       0       0       3
 row4    6       6       6       6       6       6
 row5    0       3       0       0       0       3
 row6    0       3       0       0       0       3"),
   sep = "",as.is = TRUE,header = TRUE,
   stringsAsFactors = TRUE,row.names = 1)

#So d1 looks like this: 
d1
     col1 col2 col3 col4 col5 col6
row1    0    3    0    0    0    3
row2    6    6    6    6    6    6
row3    0    3    0    0    0    3
row4    6    6    6    6    6    6
row5    0    3    0    0    0    3
row6    0    3    0    0    0    3

#Simple clustering based on rows 
clus1 <- hclust(dist(d1))
d2 <- d1[clus1$order,]
d2
     col1 col2 col3 col4 col5 col6
row6    0    3    0    0    0    3
row5    0    3    0    0    0    3
row1    0    3    0    0    0    3
row3    0    3    0    0    0    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

#Now cluster on columns and display the result 
clus2 <- hclust(dist(t(d2)))
t(t(d2)[clus2$order,])
     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

Since you tagged this code-review I guess I'll also point out that stylistically, many R folks prefer not to use T and F for booleans since they can be masked, while TRUE and FALSE cannot.

joran
  • 169,992
  • 32
  • 429
  • 468
  • I'm confused by your answer. Are you saying that the clustering my code makes is wrong? You seem to be getting the same end result that I get. Ultimately, my goal is to preserve the data in the input (such as the top-left corner label). – Alex Reynolds Oct 05 '11 at 08:14
  • @AlexReynolds Sorry; all I meant was that your current code is actually including the column of row names in the first round of clustering. That means that factor is being coerced to NAs by `dist`. It works out for your example, but I wouldn't rely on it in general. Notice Andrie's, like mine, involves making that column the row names. His method explicitly saves that info you want, though, so I'd follow his lead. – joran Oct 05 '11 at 13:33