13

I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet:

Y = fastcluster.linkage(D, method='centroid') # D-distance matrix
Z1 = sch.dendrogram(Y,truncate_mode='level', p=7,show_contracted=True)

Since the dendrogram will become rather dense with all this data, I use the truncate_mode to prune it a bit. All of this works, but I wonder how I can find out which of the original 5000 entries belong to a particular branch in the dendrogram.

I tried using

 leaves = sch.leaves_list(Y)

to get a list of leaves, but this uses the linkage output as indata, and while I can see the correspondence between the pruned dendrogram and the leaves-list, it becomes a bit cumbersome to map original entries manually to the dendrogram.

To summarize: Is there a way of listing all the original entries in the distance matrix that belongs to a branch in a pruned dendrogram? Or are there other methods of doing this that I am not aware of.

Thanks

user1354607
  • 131
  • 1
  • 4
  • maybe i'm not understanding, but couldn't you keep a copy from before pruning? – andrew cooke Apr 25 '12 at 02:17
  • I see what you mean. That could work, but will still require manual mapping of the entries, since the output after pruning is a dict with the number of members in each branch, and the output before pruning is a dict with each entry as they appear in the dendrogram. One then has to map these two together. – user1354607 Apr 25 '12 at 15:35
  • What about Z1['ivl']. According to the documentation,this is "a list of labels corresponding to the leaf nodes.". You can supply custom labels as input to the dendrogram function but by default, they are just indices of the original observation – Dhara May 27 '12 at 17:03
  • Thank you all. In the end I got it to work using the Z1['ivl'] approach, and comparing the two dictionaries before and after pruning. – user1354607 Jun 04 '12 at 18:02

1 Answers1

3

One of the dictionary data-structures returned by scipy.cluster.hierarchy.dendrogram has the key ivl, that the documentation describes as:

a list of labels corresponding to the leaf nodes

You can supply custom labels (using labels=<array of lables>) as input to the dendrogram function but by default, they are just indices of the original observation. By comparing the original labels/indices and Z1['ivl'], you can determine what the original entries were.

Dhara
  • 6,587
  • 2
  • 31
  • 46