Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1187 questions
11
votes
3 answers

spatial clustering in R (simple example)

I have this simple data.frame lat<-c(1,2,3,10,11,12,20,21,22,23) lon<-c(5,6,7,30,31,32,50,51,52,53) data=data.frame(lat,lon) The idea is to find the spatial clusters based on the distance First, I plot the map (lon,lat)…
Math
  • 1,274
  • 3
  • 14
  • 32
11
votes
1 answer

How do you visualize a ward tree from sklearn.cluster.ward_tree?

In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find examples of how to use this function. Basically my…
10
votes
1 answer

Understanding DynamicTreeCut algorithm for cutting a dendrogram

A dendrogram is a data structure used with hierarchical clustering algorithms that groups clusters at different "heights" of a tree - where the heights correspond to distance measures between clusters. After a dendrogram is created from some input…
10
votes
2 answers

Extract cluster color from output of dendextend::circlize_dendrogram()

I am trying to extract the colors used in the clustering of circlize_dendrogram. Here is a sample codes: library(magrittr) library(dendextend) cols <- c("#009000", "#FF033E", "#CB410B", "#3B444B", "#007FFF") dend <- iris[1:40,-5] %>% dist %>%…
Al-Ahmadgaid Asaad
  • 1,172
  • 5
  • 13
  • 25
10
votes
0 answers

Using precision recall metric on a hierarchy of recovered clusters

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will…
10
votes
2 answers

How to traverse a tree from sklearn AgglomerativeClustering?

I have a numpy text file array at: https://github.com/alvations/anythingyouwant/blob/master/WN_food.matrix It's a distance matrix between terms and each other, my list of terms are as such: http://pastebin.com/2xGt7Xjh I used the follow code to…
alvas
  • 115,346
  • 109
  • 446
  • 738
10
votes
0 answers

Distance metric in the Python fastcluster module

I want to do hierarchical clustering with the fastcluster module. When i the default (euclidian) distance metric, it works fine: import fastcluster import scipy.cluster.hierarchy distance = spatial.distance.pdist(data) linkage =…
user1680859
  • 1,160
  • 2
  • 24
  • 40
10
votes
3 answers

With SciPy how do I get clustering for k=? with doing hierarchical clustering

So I am using fastcluster with SciPy to do agglomerative clustering. I can do dendrogram to get the dendrogram for the clustering. I can do fcluster(Z, sqrt(D.max()), 'distance') to get a pretty good clustering for my data. What if I want to…
demongolem
  • 9,474
  • 36
  • 90
  • 105
9
votes
2 answers

Sklearn Agglomerative Clustering Custom Affinity

I'm trying to use agglomerative clustering with a custom distance metric (ie affinity) since I'd like to cluster a sequence of integers by sequence similarity and not something like the euclidean distance which isn't meaningful. My data looks…
9
votes
2 answers

Find partial membership with KMeans clustering algorithm

I can calculate cluster membership with KMeans pretty easily: open System open System.IO open Utils open Accord open Accord.Math open Accord.MachineLearning let vals = [| [|1.0; 2.0; 3.0; 2.0|] [|1.1; 1.9; 3.1; 4.0|] [|2.0; 3.0; 4.0;…
Steven
  • 3,238
  • 21
  • 50
9
votes
2 answers

Duelling dendrograms in r (Placing dendrograms back to back in r)

Is there any fairly straight forward way of placing two dendrogram 'back to back' in r? The two dendrograms contain the same objects but are clustered in slightly different ways. I need to emphasise how the dendrograms differ. So something like what…
Elizabeth
  • 6,391
  • 17
  • 62
  • 90
8
votes
1 answer

HDBSCAN difference between parameters

I'm confused about the difference between the following parameters in HDBSCAN min_cluster_size min_samples cluster_selection_epsilon Correct me if I'm wrong. For min_samples, if it is set to 7, then clusters formed need to have 7 or more…
8
votes
1 answer

Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram

I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words. So I have a 1000 numpy.array like so: …
8
votes
3 answers

With SciPy dendrogram, can I change the linewidth?

I'm making a big dendrogram using SciPy and in the resulting dendrogram the line thickness makes it hard to see detail. I want to decrease the line thickness to make it easier to see and more MatLab like. Any suggestions? I'm doing: import…
ja.kb.ca
  • 83
  • 1
  • 4
8
votes
2 answers

Where can I find a good set of benchmark clustering datasets with ground truth labels?

I am looking for a clustering dataset with "ground truth" labels for some known natural clustering, preferably with high dimensionality. I found some good candidates here (http://cs.joensuu.fi/sipu/datasets/), but only the Glass and Iris data-sets…
1 2
3
79 80