Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
13
votes
3 answers

News clustering

How does Google News and Techmeme cluster news items that are similar? Are there any well know algorithm that is used to achieve this? Appreciate your help. Thanks in advance.
niraj
  • 215
  • 3
  • 8
13
votes
3 answers

Where to find a reliable K-medoid(Not k-means) open source software/tool?

I am learning the K-medoids algorithm so I am sorry if I ask inappropriate questions. As I know,the K-medoids algorithm implements a K-means clustering but use actual data points to be centroid instead of mathematical calculated means. As I googled…
Cassie
  • 1,179
  • 6
  • 18
  • 30
13
votes
2 answers

Weka simple K-means clustering assignments

I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry. I am using…
machine yearning
  • 9,889
  • 5
  • 38
  • 51
13
votes
1 answer

How to find the success rate of a clustering algorithm?

I have implemented several clustering algorithms on an image dataset. I'm interested in deriving the success rate of clustering. I have to detect the tumor area, in the original image I know where the tumor is located, I would like to compare the…
13
votes
3 answers

Clustering with a distance matrix

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example, A B C D E F G H I J K L A 0 20 20 20 40 60 60 60 100 120 120 120 B 20 0 20 20 60 80 80 80 120 140 140…
yassin
  • 6,529
  • 7
  • 34
  • 39
13
votes
3 answers

Newman's modularity clustering for graphs

I am interested in running Newman's modularity clustering algorithm on a large graph. If you can point me to a library (or R package, etc) that implements it I would be most grateful. best ~lara
laramichaels
  • 1,515
  • 5
  • 18
  • 30
13
votes
1 answer

Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"

This is a question related to https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o For completeness, here's the original example from that…
Rafael Santos
  • 463
  • 5
  • 16
13
votes
2 answers

How to print result of clustering in sklearn

I have a sparse matrix from scipy.sparse import * M = csr_matrix((data_np, (rows_np, columns_np))); then I'm doing clustering that way from sklearn.cluster import KMeans km = KMeans(n_clusters=n, init='random', max_iter=100, n_init=1,…
thepolina
  • 1,244
  • 1
  • 14
  • 28
13
votes
3 answers

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news…
frnsys
  • 2,404
  • 3
  • 21
  • 25
13
votes
3 answers

Extract labels membership / classification from a cut dendrogram in R (i.e.: a cutree function for dendrogram)

I'm trying to extract a classification from a dendrogram in R that I've cut at a certain height. This is easy to do with cutree on an hclustobject, but I can't figure out how to do it on a dendrogram object. Further, I can't just use my clusters…
Oreotrephes
  • 447
  • 1
  • 4
  • 10
13
votes
3 answers

Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not…
Thalis K.
  • 7,363
  • 6
  • 39
  • 54
13
votes
2 answers

Algorithm to decide cut-off for collapsing this tree?

I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences. An interactive version of the tree is up on iTol…
hello_there_andy
  • 2,039
  • 2
  • 21
  • 51
13
votes
2 answers

What method do you use for selecting the optimum number of clusters in k-means and EM?

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects. What method do you use to determine the number of clusters in…
gd047
  • 29,749
  • 18
  • 107
  • 146
13
votes
4 answers

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori…
astromax
  • 6,001
  • 10
  • 36
  • 47
13
votes
4 answers

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.…
Joschi
  • 2,941
  • 9
  • 28
  • 36