13

I'm using hierarchical clustering to cluster word vectors, and I want the user to be able to display a dendrogram showing the clusters. However, since there can be thousands of words, I want this dendrogram to be truncated to some reasonable valuable, with the label for each leaf being a string of the most significant words in that cluster.

My problem is that, according to the docs, "The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster." I take this to mean I can't label clusters, only singular points?

To illustrate, here is a short python script which generates a simple labeled dendrogram:

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')

labelList = ["foo" for i in range(0, 20)]

plt.figure(figsize=(15, 12))
dendrogram(
            linked,
            orientation='right',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )
plt.show()

a dendrogram of randomly generated points

Now let's say I want to truncate to just 5 leaves, and for each leaf, label it like "foo, foo, foo...", ie the words that make up that cluster. (Note: generating these labels is not the issue here.) I truncate it, and supply a label list to match:

labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
            linked,
            orientation='right',
            p=5,
            truncate_mode='lastp',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )

and here's the problem, no labels:

enter image description here

I'm thinking there might be a use here for the parameter 'leaf_label_func' but I'm not sure how to use it.

EmmetOT
  • 552
  • 1
  • 7
  • 23
  • I'm a couple years late to the party but the reason why there are no labels is because the label parameter for dendrogram only works for singleton clusters. Non singleton require a more sophisticated approach – Yuca Jul 03 '18 at 17:04

3 Answers3

9

You are correct about using the leaf_label_func parameter.

In addition to creating a plot, the dendrogram function returns a dictionary (they call it R in the docs) containing several lists. The leaf_label_func you create must take in a value from R["leaves"] and return the desired label. The easiest way to set labels is to run dendrogram twice. Once with no_plot=True to get the dictionary used to create your label map. And then again to create the plot.

randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')

labels = ["A", "B", "C", "D"]
p = len(labels)

plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)

# call dendrogram to get the returned dictionary 
# (plotting parameters can be ignored at this point)
R = dendrogram(
                linked,
                truncate_mode='lastp',  # show only the last p merged clusters
                p=p,  # show only the last p merged clusters
                no_plot=True,
                )

print("values passed to leaf_label_func\nleaves : ", R["leaves"])

# create a label dictionary
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
    return "{} - custom label!".format(temp[xx])

## This version gives you your label AND the count
# temp = {R["leaves"][ii]:(labels[ii], R["ivl"][ii]) for ii in range(len(R["leaves"]))}
# def llf(xx):
#     return "{} - {}".format(*temp[xx])


dendrogram(
            linked,
            truncate_mode='lastp',  # show only the last p merged clusters
            p=p,  # show only the last p merged clusters
            leaf_label_func=llf,
            leaf_rotation=60.,
            leaf_font_size=12.,
            show_contracted=True,  # to get a distribution impression in truncated branches
            )
plt.show()
coradek
  • 507
  • 4
  • 16
  • 2
    +1 for fellow galvanize alum. However, do you know of a way to get the observations that make up the truncated leaves? e.g. I have 130k samples, I truncate at 100 clusters and want to know which observations reside in each cluster. – Grr May 01 '19 at 18:43
  • 2
    @Grr I use scipy.cluster.hierarchy.fcluster to retrieve the clusters. Jörn Hees has a good tutorial for this at https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ – coradek May 31 '19 at 22:27
4

you can simply write:

hierarchy.dendrogram(Z, labels=label_list)

Here is a good example, using pandas Data Frame :

import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

data = [[24, 16], [13, 4], [24, 11], [34, 18], [41, 
6], [35, 13]]
frame = pd.DataFrame(np.array(data), columns=["Rape", 
"Murder"], index=["Atlanta", "Boston", "Chicago", 
"Dallas", "Denver", "Detroit"])

Z = hierarchy.linkage(frame, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z, labels=frame.index)
0

Seems to me that @coradek answer have a little mistake, though it was very helpful

I used his code (with df as pandas DataFrame) with correction:

plt.figure(figsize=(20,10))
labelList = df.apply(lambda x: f"{x['...']}",axis=1)
Z = linkage(df[["..."]])
R = dendrogram(Z,no_plot=True)
labelDict = {leaf: labelList[leaf] for leaf in R["leaves"]}
dendrogram(Z,leaf_label_func=lambda x:labelDict[x])
plt.show()

because the code presented above always gave me the same order of ticks

Dmitry
  • 71
  • 2
  • 2