Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1187 questions
5
votes
2 answers

Clustering time series data in Python

I am trying to cluster time series data in Python using different clustering techniques. K-means didn't give good results. The following images are what I have after clustering using agglomerative clustering. I also tried Dynamic Time warping. These…
5
votes
1 answer

Hierarchical clustering on sparse observation matrix

I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather…
Siegmeyer
  • 4,312
  • 6
  • 26
  • 43
5
votes
0 answers

How to compute cophenetic correlation from the linkage matrix output by fastcluster's memory saving hierarchical clustering method?

I'm using the fastcluster package for Python to compute the linkage matrix for a hierarchical clustering procedure over a large set of observations. So far so good, fastcluster's linkage_vector() method brings the capability of clustering a much…
PDRX
  • 1,003
  • 1
  • 11
  • 15
5
votes
2 answers

Relation between dendrogram plot coordinates and ClusterNodes in scipy

I'm looking for a way to get the coordinates of a cluster point in the dendrogram plot based on its ClusterNode return by to_tree. Using scipy to build a dendogram from data such as: X = data Y = pdist(X) Z = linkage(Y) dend =…
sereizam
  • 2,048
  • 3
  • 20
  • 29
5
votes
1 answer

Specify max distance in agglomerative clustering (scikit learn)

When using a clustering algorithm, you always have to specify a shutoff parameter. I am currently using Agglomerative clustering with scikit learn, and the only shutoff parameter that I can see is the number of clusters. agg_clust =…
Arcyno
  • 4,153
  • 3
  • 34
  • 52
5
votes
0 answers

Hierarchical clustering parallel processing in R

Is there straightforward method to take advantage of parallel processing in R within a HPC cluster to make my computations faster for Hierarchical clustering algorithm? Because right now, the average utilization of processors is just 1 though i can…
Kraamed
  • 51
  • 2
5
votes
1 answer

Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug...)

I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if I am not understanding it anymore or there is a…
O.rka
  • 29,847
  • 68
  • 194
  • 309
5
votes
1 answer

Extract the hierarchical structure of the nodes in a dendrogram or cluster

I would like to extract the hierarchical structure of the nodes of a dendrogram or cluster. For example in the next example: library(dendextend) dend15 <- c(1:5) %>% dist %>% hclust(method = "average") %>% as.dendrogram dend15 %>% plot The nodes…
Ruben
  • 493
  • 4
  • 18
5
votes
2 answers

nbclust doesn't work without data matrix

I was trying to use the nbclust function and got the error: "Error in t(jeu) %*% jeu : requires numeric/complex matrix/vector arguments" this is how I run the function: NbClust(input_data, diss = dissimilarity_matrix, …
5
votes
1 answer

How to create a distance matrix for clustering using correlation instead of euclidean distance in R?

Goal I want to do hierarchical clustering of samples (rows) in my data set. What I know: I have seen examples where distance matrices are created using euclidean distance, etc by employing dist() function in R. I have also seen correlation being…
umair durrani
  • 5,597
  • 8
  • 45
  • 85
5
votes
1 answer

How to calculate Silhouette Score of the scipy's fcluster using scikit-learn silhouette score?

I am using scipy.cluster.hierarchy.linkage as a clustering algorithm and pass the result linkage matrix to scipy.cluster.hierarchy.fcluster, to get the flattened clusters, for various thresholds. I would like to calculate the Silhouette score of…
J.J
  • 51
  • 1
  • 3
5
votes
1 answer

Creating and graphing Hierarchical Trees in Python with pandas

So I have hierarchical information stored within a pandas DataFrame and I would like to construct and visualize a hierarchical tree based on this information. For example, a row in my DataFrame has the column headings…
Wes Field
  • 3,291
  • 6
  • 23
  • 26
5
votes
2 answers

Why does scipy.cluster.hierarchy.linkage need a metric?

We're required to pass a distance matrix, so there should be no need to calculate any additional distances, right? What am I missing? Documentation here:…
elplatt
  • 3,227
  • 3
  • 18
  • 20
5
votes
1 answer

interpreting the results of OPTICSxi Clustering

I am interested in detecting clusters in areas with varying-density, such as user-generated data in cities, and for that I adopted the OPTICS algorithm. Unlike DBSCAN, the OPTICS algorithm does not produce a strict cluster partition, but an…
5
votes
3 answers

Efficiently find minimum of large array using Opencl

I am working on the implementation of a hierarchical clustering algorithm in opencl. For each step, I have find the minimum value in a very large array (approx. 10^8 entries) so that I know which elements have to be combined into a new cluster. The…
mTORjaeger
  • 67
  • 1
  • 7