Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1187 questions
18
votes
2 answers

Implementing an efficient graph data structure for maintaining cluster distances in the Rank-Order Clustering algorithm

I'm trying to implement the Rank-Order Clustering here is a link to the paper (which is a kind of agglomerative clustering) algorithm from scratch. I have read through the paper (many times) and I have an implementation that is working although it…
YellowPillow
  • 4,100
  • 6
  • 31
  • 57
16
votes
2 answers

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code: Y =…
Adrian Rosebrock
  • 907
  • 2
  • 9
  • 18
15
votes
5 answers

Parallel construction of a distance matrix

I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in…
14
votes
1 answer

Using igraph in python for community detection and writing community number for each node to CSV

I have an network that I would like to analyze using the edge_betweenness community detection algorithm in igraph. I'm familiar with NetworkX, but am trying to learning igraph because of it's additional community detection methods over NetworkX. My…
CurtLH
  • 2,329
  • 4
  • 41
  • 64
14
votes
1 answer

Matching dendrogram with cluster number in Python's scipy.cluster.hierarchy

The following code generates a simple hierarchical cluster dendrogram with 10 leaf nodes: import scipy import scipy.cluster.hierarchy as sch import matplotlib.pylab as plt X = scipy.randn(10,2) d = sch.distance.pdist(X) Z=…
user1910316
  • 499
  • 7
  • 17
14
votes
2 answers

bootstrapping hierarchical/multilevel data (resampling clusters)

I am producing a script for creating bootstrap samples from the cats dataset (from the -MASS- package). Following the Davidson and Hinkley textbook [1] I ran a simple linear regression and adopted a fundamental non-parametric procedure for…
Stefano Lombardi
  • 1,581
  • 2
  • 22
  • 48
13
votes
1 answer

Scikit-learn Agglomerative Clustering Connectivity Matrix

I am attempting to perform constrained clustering using sklearn's agglomerative clustering command. To make the algorithm constrained, it requests a "connectivity matrix". This is described as: The connectivity constraints are imposed via an…
Michael Davidson
  • 1,391
  • 1
  • 14
  • 31
13
votes
1 answer

Pruning dendrogram in scipy (hierarchical clustering)

I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet: Y = fastcluster.linkage(D, method='centroid') # D-distance matrix Z1 =…
user1354607
  • 131
  • 1
  • 4
12
votes
1 answer

agglomerative clustering in sklearn

I have some data and also the pairwise distance matrix of these data points. I want to cluster them using Agglomerative clustering. I readthat in sklearn, we can have 'precomputed' as affinity and I expect it is the distance matrix. But I could not…
B bonita
  • 171
  • 2
  • 5
12
votes
2 answers

Which programming structure for clustering algorithm

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here): Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a…
Andrej
  • 3,719
  • 11
  • 44
  • 73
12
votes
2 answers

Is there any sparse support for dist function in R?

Have anyone heard about any package or functionality that works the same as the dist{stats} function from R which creates the distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a…
Marcin
  • 7,834
  • 8
  • 52
  • 99
11
votes
1 answer

Dendrogram y-axis labeling confusion

I have a large (106x106) correlation matrix in pandas with the following…
jason m
  • 6,519
  • 20
  • 69
  • 122
11
votes
4 answers

Time series distance metric

In order to clusterize a set of time series I'm looking for a smart distance metric. I've tried some well known metric but no one fits to my case. ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]: I want to put…
paolof89
  • 1,319
  • 5
  • 17
  • 31
11
votes
1 answer

DIvisive ANAlysis (DIANA) Hierarchical Clustering

(This post is continuation of my previous question on divisive hierarchical clustering algorithm.) The problem is how to implement this algorithm in Python (or any other language). Algorithm description A divisive clustering proceeds by a series of…
Andrej
  • 3,719
  • 11
  • 44
  • 73
11
votes
1 answer

Cutting SciPy hierarchical dendrogram into clusters via a threshold value

I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are…
Bryan
  • 5,999
  • 9
  • 29
  • 50
1
2
3
79 80