0

I want to create a dendrogram using an index (proportion data) that will show similar clusters. I am trying to decide what distance/similarity metric I have to use so that they represent the original index values.

I have a data frame that looks like this:

 data<-read.table(text="ind  index
T1  0.10
T2  0.11
                 T3 0.01
                 T4 0.64
                 T5 0.03
                 T6 0.15
                 T7 0.26
                 T8 0.06
                 T9 0.01
                 T10    0.004
                 T11    0.01
                 T12    0.19
                 T13    0.04
                 T14    0.69
                 T15    0.06
                 T16    0.51
                 T17    0.15
                 T18    0.26
                 T19    0.26
                 T20    0.01
                 ",header=T)

head(data)

data2<-as.matrix(data[,2])

d<-dist(data2)

# prepare hierarchical cluster
hc = hclust(d)
# very simple dendrogram
plot(hc)

This will produce a simple dendrogram. However, I actually want to use the values from the index column as "my distance". Any suggestions are welcome. Thanks in advance!

user1626688
  • 1,583
  • 4
  • 18
  • 27
  • What is T1? The distance of obj 1 to obj 2? – Has QUIT--Anony-Mousse Feb 08 '15 at 13:18
  • No T1, T2, etc., are unique individuals. I want to show a way of grouping by the index column, which is a proportion of time (0-1) each individual spend in a specific area. I am not sure what is the correct way of grouping/clustering for this kind of data. – user1626688 Feb 08 '15 at 13:27
  • 1
    On one-dimensional data, most distance functions do the exact same... I do not understand your question. – Has QUIT--Anony-Mousse Feb 08 '15 at 13:29
  • Hi. I gave an answer, but on second thought - I may not have understood it. Did you want the dend to represent distance of the values in INDEX as well as possible. OR, are you looking for a way to position the values in the INDEX position on the x axis, and on top of it to add a dendrogram? – Tal Galili Mar 05 '15 at 10:32

2 Answers2

1

Perhaps this will help? Your values are on the y-axis.

hc <- hclust(d = d, method="single", members=NULL)
library(ggdendro)
ggdendrogram(hc, theme_dendro=FALSE)

enter image description here

lawyeR
  • 7,488
  • 5
  • 33
  • 63
1

You can use the cophenetic function to extract the distance matrix of the hclust object. With that, you can check how well your dendrogram is representing your original distance function (by checking the correlation between your original distance to the cophenetic distance from the dendrogram). For example:

> hc <- hclust(d, method="single")
> cor(d, cophenetic(hc))
[1] 0.9270891
> hc <- hclust(d, method="complete")
> cor(d, cophenetic(hc))
[1] 0.9249611

This tells you that "single" method is a tiny bit better than "complete", but that neither of the two are able to fully capture the original distance matrix (since their correlation is not 1).

I hope this helps.

Tal Galili
  • 24,605
  • 44
  • 129
  • 187