Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1187 questions
4
votes
1 answer

Triangle vs. Square distance matrix for Hierarchical Clustering Python?

I have been experimenting with Hierarchical Clustering and in R it's so simple hclust(as.dist(X),method="average") . I found a method in Python that is pretty simple as well, except I'm a little confused on what's going on with my input distance…
O.rka
  • 29,847
  • 68
  • 194
  • 309
4
votes
2 answers

How to change node labels of dendrogram plot

I did a hierarchical cluster for a project. I have 300 observations each of 20 variables. I indexed all the variables so that each variable is between 0 and 1, a larger value being better. I used the following code to create a cluster plot. d_data…
4
votes
1 answer

Large distance matrix in clustering

I am running R 3.2.3 on a machine with 16 GB RAM. I have a large matrix of 3,00,000 rows x 12 columns. I wanna use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix. Since data is of mixed type, I…
4
votes
1 answer

R large distance matrix in vegan

I am running R 3.2.3 on a machine with 128 GB of RAM. I have a large matrix of 123028 rows x 168 columns. I would like to use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix in R using the…
jk22
  • 95
  • 1
  • 1
  • 8
4
votes
0 answers

Memory Efficient Agglomerative Clustering with Linkage in Python

I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2. When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB. class…
4
votes
1 answer

Python hierarchical clustering with missing values

I am new to Python. I would like to perform hierarchical clustering on N by P dataset that contains some missing values. I am planning to use scipy.cluster.hierarchy.linkage function that takes distance matrix in condensed form. Does Python have a…
4
votes
2 answers

Annotating Dendrogram nodes in Scipy/Matplotlib

I'm trying to label the nodes in a dendrogram produced by scipy.cluster.hierarchy.dendrogram. I'm working with the augmented dendrogram suggested here, trying to replace the inter-cluster distance labels (1.01,1.57) in the example by strings such…
user666
  • 5,231
  • 2
  • 26
  • 35
4
votes
1 answer

How to draw the plot of within-cluster sum-of-squares for a cluster?

I have a cluster plot by R while I want to optimize the "elbow criterion" of clustering with a wss plot, but I do not know how to draw a wss plot for a giving cluster, anyone would help me? Here is my…
Ping Tang
  • 415
  • 1
  • 9
  • 20
4
votes
1 answer

Use a similarity function for clustering scikit-learn

I use a function to calculate similarity between a pair of documents and wanto perform clustering using this similarity measure. Code so Far Sim=np.zeros((n, n)) # create a numpy arrary i=0 j=0 for i in range(0,n): for j in…
AMisra
  • 1,869
  • 2
  • 25
  • 45
4
votes
1 answer

Hierarchical clustering from confusion matrix with python

Using on the following answer, I tried to code hierarchical class clustering based on confusion matrix. Confusion matrix is used to evaluate results of classification problem and isn't symmetric. Each row represents the instances in an actual class.…
Eric
  • 2,301
  • 3
  • 23
  • 30
4
votes
1 answer

How to hierarchically cluster a data matrix in R?

I am trying to cluster a data matrix produced from scientific data. I know how I want the clustering done, but am not sure how to accomplish this feat in R. Here is what the data looks like: A1 A2 A3 B1 B2 B3 C1 …
jake9115
  • 3,964
  • 12
  • 49
  • 78
4
votes
1 answer

How do you print the rows of a hclust object in R?

I am using R to cluster a matrix which I have named 'tissuedata'. I have a hclust object which was generated using the following code: TissueDist<-dist(tissuedata, method="euclidean") TissueClust<-hclust(TissueDist, method='complete') Now I…
user2639056
  • 295
  • 1
  • 5
  • 10
4
votes
2 answers

clustering data without input parameters

This is more of a theoretical question: do you know any clustering algorithm (flat or hierarchical) which does not require any input parameters, like the number of clusters or size of the neighborhood etc? in other words, you simply feed your data…
4
votes
1 answer

Are there any good hierarchical clustering packages in python which take distance matrix?

I have a distance matrix composed of pair-wise levenshtein's distance. I was using scikits-learn. But hierarchical clustering algorithm doesn't take distance matrix as input for clustering. SO I have to search for a new package which can do this.…
darshan
  • 1,230
  • 1
  • 11
  • 17
4
votes
2 answers

K-centers clustering using R

I can't find a simple library function for k-centers clustering using R, whereas I could for k-means (kmeans()) and hierarchical clustering (hclust()). Is there a library function for simple greedy k-centers clustering using R as depicted in this …
torger
  • 2,308
  • 4
  • 28
  • 35