2

I am working on a dataset that has 20.000 variables. Those variables are measured using the same unit meassurement but since it is a very large number, I decided to cluster the variables to obtain groups of somehow related variables.

I decided that a good option was applying hierarchical clustering, and I used the following code (assume D is the data frame):

d <- dist(D, method = "euclidean") 
clust1 <- hclust(d, method="ward.D") 
plot(clust1)
groups <- cutree(fit, k=150) 

The dendogram I obtained is the following: enter image description here

As you can see, the name of the variables makes it very hard to see something useful here, but I actually dont know how to do so that R does not display variable names on the dendogram.

I also have another question: I used the order "cutree" to build the gropus, but as discovered, this order has a limitation, and can only build as much as 150 gropus. ¿Is there any other way to build the groups without this limitation?

Thank you very much

PD: Any other suggestion about how to group this crazy dataset will be well recieved

  • Explore `ape::plot.phylo()` functionality to display your dendrogram without labels. Some options are [here] (http://stackoverflow.com/questions/37563747/equally-spaced-out-lengths-in-dendrograms/37565014#37565014). – nya Jun 02 '16 at 20:25

1 Answers1

2

Do you mean suppressing case labels rather than variable labels? If so, use as.dendrogram with the leaflab argument

plot(as.dendrogram(clust1),leaflab='none')

I don't think there is a limit for k in cutree. You may want to try the package flashClust, which works better with large datasets for hierarchical clustering.

jkt
  • 946
  • 1
  • 7
  • 18
  • I just noticed that you use `groups <- cutree(fit, k=150)`, but you actually call your cluster object `clust1`. Might this be the source of the error? – jkt Jun 03 '16 at 10:33