Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1187 questions
-1
votes
1 answer

How to read multiple text files in Spark for document clustering?

I want to read multiple text documents from a directory for document clustering. For that, I want to read data as: SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]").set("spark.executor.memory", "2g"); JavaSparkContext…
-1
votes
3 answers

number automatically of cluster hierchical clustering

I have a question related to hierchical clustering. My data set contains 10.000 objects. When I proceed to a clustering based on a hierchical clustering I obtain in final 30 clusters. I have used average link to do so. The issue is that I don't…
A.Dorra
  • 41
  • 1
  • 2
  • 7
-1
votes
1 answer

How to get the top N frequent words in each cluster? Sklearn

I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering Note: X, is a sparse NumPy 2D array…
-1
votes
1 answer

Plotting hierarchical clustering dendrograms for large data sets

I have a huge data set of time series data. In order to visualise the clustering in python, I want to plot time series graphs along with the dendrogram as shown below. I tried to do it by using subgrid2plot() function in python by creating two…
Shivam Mitra
  • 1,040
  • 3
  • 17
  • 33
-1
votes
1 answer

How can I cluster buckets of strings?

I have several buckets. Each bucket contains many tags (strings). How can I cluster buckets together based on similarity or overlap? E.g. Bucket A: 'ostrich', 'sparrow', 'hummingbird', 'zebra', 'blue jay' Bucket B: 'banana', 'watermelon', 'grape',…
benwiz
  • 2,167
  • 3
  • 22
  • 33
-1
votes
1 answer

R dendrogram by cluster

I am using R to plot a dendrogram of a hierarchial clustering. I have realised a hierarchical clustering of ~3000 elements. The plot of the corresponding tree is obviously super messy. These 3000 elements are clustered in 20 groups using the cutree…
Oselm
  • 7
  • 3
-1
votes
1 answer

Extrapolation of sample to population

How to extrapolate a sample of 10,000 rows to the entire population (100,000) in python. I did agglomerative clustering on the sample in python, stuck with extrapolating the result to the entire population.
-1
votes
1 answer

Hierarchical Clustering using Python on Correlation Coefficient

I have the data in 50 by 50 Matrix that represents the 50 Journals with their correlation. Now, I am trying to plot the graph showing on which clusters those 50 Journals fall based on the data. 1) I prefer to use complete-linkage or Ward's method…
-1
votes
1 answer

how to calculate distance between any two elements in more than 10^8 data to Clustering them using spark?

I have more than 10^8 records stored in elasticSearch. Now I want to clustering them by writing a hierarchical algorithm or using PIC based on spark MLlib. However, I can't use some efficient algorithm like K-means because every record is stored in…
-1
votes
1 answer

Optimal Clusters Formula: Finding Equivalent Using NbClust

I have two variables that I calculated from Matrix B: 1) The Correlation Matrix cor(B) 2) The Hierarchical Cluster of the Dissimilarity Matrix from the Correlation Matrix I then used the clustConfigurations function to calculate the "elbow graph"…
nak5120
  • 4,089
  • 4
  • 35
  • 94
-1
votes
1 answer

Compare the clustering algorithms in R

I have implement 3 clustering algorithms in R (PAM, k-means and hierarchical). I want to find which parameters produce the best results of each algorithm. I have no idea how to do it in R. Does anyone know how to do it? Thank you for your help.
-1
votes
1 answer

Dendrogram in C#

I implemented a version of AGNES (agglomerative clustering algorithm) in C#, but I am struggling to implement a dendrogram. I implemented a binary tree using a treeview component however I will need to build a "real" dendrogram for analysis of…
-1
votes
1 answer

Clustering results print with details in python

I just try to print my clustering results in python (2D array Numpy). But I can't find any solution about printing results. I draw my dendrogram but I need results for example: Cluster 1: Cluster 2: Cluster 3: My code is: from matplotlib import…
-1
votes
2 answers

Clustering based on co-occurrences

I would like to cluster data based on the co-occurrences keyword using R. I have encountered 2 difficulties compared to other posts. The words are of different hierarchy levels The keywords do not necessarily show in the order or the hierarchy…
-1
votes
1 answer

R: Hierarchical clustering

Let's say we have the following dataset set.seed(144) dat <- matrix(rnorm(100), ncol=5) The following function creates all possible combinations of columns and removes the first (combinations <- do.call(expand.grid, rep(list(c(F, T)),…