0

I would like to use matplotlib to draw a dendrogram without using scipy. A similar question has been posted here; however, the marked solution suggests using scipy and the links in the other answers suggesting using ETE do not work. Using this example, I have verified the accuracy of my own method (ie, not scipy method) to apply agglomerative hierarchical clustering using the single-linkage criterion.

Using the same example linked from above, I have the necessary parameters to create my own dendrogram. The original distance_matrix is given by:

 .. DISTANCE MATRIX (SHAPE=(6, 6)):
[[  0 662 877 255 412 996]
 [662   0 295 468 268 400]
 [877 295   0 754 564   0]
 [255 468 754   0 219 869]
 [412 268 564 219   0 669]
 [996 400   0 869 669   0]]

A masked array of distance_matrix is used such that the diagonal entries from above are not counted as minimums. The mask of the original distance_matrix is given by:

 .. MASKED (BEFORE) DISTANCE MATRIX (SHAPE=(6, 6)):
[[-- 662 877 255 412 996]
 [662 -- 295 468 268 400]
 [877 295 -- 754 564 0]
 [255 468 754 -- 219 869]
 [412 268 564 219 -- 669]
 [996 400 0 869 669 --]]

distance_matrix is changed in-place at every iteration of the algorithm. Once the algorithm has completed, distance_matrix is given by:

 .. MASKED (AFTER) DISTANCE MATRIX (SHAPE=(1, 1)):
[[--]]

The levels (minimum distance of each merger) are give by:

 .. 5 LEVELS:
[138, 219, 255, 268, 295]

We can also view the indices of the merged datapoints at every iteration; these indices correspond to the original distance_matrix since reducing dimensions has the effect of changing index positions. These indices are given by:

 .. 5x2 LOCATIONS:
[(2, 5), (3, 4), (0, 3), (0, 1), (0, 2)]

From these indices, the ordering of the xticklabels of the dendrogram are given chronologically as:

.. 6 XTICKLABELS
[2 5 3 4 0 1]

In relation to the linked example,

0 = BA
1 = FI 
2 = MI 
3 = NA 
4 = RM 
5 = TO

Using these parameters, I would like to generate a dendrogram that looks like the one below (borrowed from linked example):

example dendrogram

My attempt at trying to replicate this dendrogram using matplotlib is below:

fig, ax = plt.subplots()
for loc, level in zip(locations, levels):
    x = np.array(loc)
    y = level * np.ones(x.size)
    ax.step(x, y, where='mid')
    ax.set_xticks(xticklabels)
    # ax.set_xticklabels(xticklabels)
    plt.show()
    plt.close(fig)

My attempt above produces the following figure:

attempted dendrogram

I realize I have to reorder the xticklabels such that the first merged points appear at the right-edge, with each subsequent merger shifting towards the left; doing so necessarily means adjusting the width of the connecting lines. Also, I was using ax.step instead of ax.bar so that the lines would appear more organized (as opposed to rectangular bars everywhere); the only thing I can think to do is to draw horizontal and vertical lines using ax.axhline and ax.axvline. I am hoping there is a simpler way to accomplish what I would like. Is there a straight-forward approach using matplotlib?

1 Answers1

0

While it would certainly be easier to rely on scipy, this is how I'd do it "manually", i.e. without it:

import matplotlib.pyplot as plt
import numpy as np

def mk_fork(x0,x1,y0,y1,new_level):
    points=[[x0,x0,x1,x1],[y0,new_level,new_level,y1]]
    connector=[(x0+x1)/2.,new_level]
    return (points),connector

levels=[138, 219, 255, 268, 295]
locations=[(2, 5), (3, 4), (0, 3), (0, 1), (0, 2)]
label_map={
    0:{'label':'BA','xpos':0,'ypos':0},
    1:{'label':'FI','xpos':3,'ypos':0},
    2:{'label':'MI','xpos':4,'ypos':0},
    3:{'label':'NA','xpos':1,'ypos':0},
    4:{'label':'RM','xpos':2,'ypos':0},
    5:{'label':'TO','xpos':5,'ypos':0},
}

fig,ax=plt.subplots()

for i,(new_level,(loc0,loc1)) in enumerate(zip(levels,locations)):

    print('step {0}:\t connecting ({1},{2}) at level {3}'.format(i, loc0, loc1, new_level ))

    x0,y0=label_map[loc0]['xpos'],label_map[loc0]['ypos']
    x1,y1=label_map[loc1]['xpos'],label_map[loc1]['ypos']

    print('\t points are: {0}:({2},{3}) and {1}:({4},{5})'.format(loc0,loc1,x0,y0,x1,y1))

    p,c=mk_fork(x0,x1,y0,y1,new_level)

    ax.plot(*p)
    ax.scatter(*c)

    print('\t connector is at:{0}'.format(c))


    label_map[loc0]['xpos']=c[0]
    label_map[loc0]['ypos']=c[1]
    label_map[loc0]['label']='{0}/{1}'.format(label_map[loc0]['label'],label_map[loc1]['label'])
    print('\t updating label_map[{0}]:{1}'.format(loc0,label_map[loc0]))

    ax.text(*c,label_map[loc0]['label'])

_xticks=np.arange(0,6,1)
_xticklabels=['BA','NA','RM','FI','MI','TO']

ax.set_xticks(_xticks)
ax.set_xticklabels(_xticklabels)

ax.set_ylim(0,1.05*np.max(levels))

plt.show()

This mostly relies on creating the dictionary label_map, which maps the original "locations" (i.e. (2,5)) to the "xtick order" (i.e. (4,5)). A "fork" is created in each step i using mk_fork(), which returns both points (which are subsequently connected in a line plot) as well as the connector point, which is then stored as the new values for 'xpos','ypos' within the label_map.

I've added multiple print() statements to emphasize what happens at each step, and added a .text() to highlight the location of each "connector".

Result: a simple dendrogram

Asmus
  • 5,117
  • 1
  • 16
  • 21
  • If I may follow-up on what you did, I understand the use of `mk_fork`; the print steps were very helpful. But, I do not understand how you determined the values of `label_map[key]['xpos']` as they correspond to each `key = 0, 1, ..., 5`. Can you please elaborate on this part? –  May 14 '19 at 11:36
  • @allthemikeysaretaken You mean the initial values? Those are simply ordered as you "wanted them", i.e. as in the exemplary picture. In the first entry of your `locations` array: `(2, 5)`, we know that `2` refers to `'MI'` (and `5` refers to `'TO'`, and so on), simply by looking at `label_map[2]['label']`. In the dendrogram, you wanted to plot `'MI'` it at `x=4, y=0` (and `'RM'` at `x=2`, `'TO'` at `x=5`...), hence I stored these initials values. – Asmus May 14 '19 at 14:49
  • 1
    @allthemikeysaretaken if you mean: "where does this initial order come from?", then its quite simply the "order of dissimilarity", i.e. the order of distances in the first row of your masked distance array. If you run `print(np.argsort(distance_array_name[0,:]))`, this should return `[0 3 4 1 2 5]`, i.e. the Italian towns, sorted by their distance to "Bari": `[0, 255, 412, 662, 877, 996]` – Asmus May 14 '19 at 15:04
  • Ah, I was asking the latter. I am using your example to try dynamically creating my dendrogram. Thanks for clearing that up. –  May 14 '19 at 19:44
  • I think I understand the algorithm with the exception of one part. I get that the section starting with `if loc0 in label_map:` is used check whether the vertical line starts/ends at the last previously iterated value of `level` (rather than `0`) by checking if the city has already been branched from in a previous iteration. But, I do not understand how the line `if loc0 in label_map:` works since all of the values of `loc0 = [0, 1, ..., 5]` are keys for `label_map`. –  May 16 '19 at 03:10
  • 1
    @allthemikeysaretaken Oops, I missed that! The `if loc0 in label_map` line is from a previous iteration of my code and not required at all, since (as you said) `loc0` should of course always be one of the `label_map.keys()`. I removed it above. – Asmus May 16 '19 at 04:49