Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
42
votes
3 answers

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The…
41
votes
3 answers

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based…
41
votes
3 answers

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine. My problem here is that I don't need to…
Jamie Bull
  • 12,889
  • 15
  • 77
  • 116
40
votes
2 answers

Calculating the percentage of variance measure for k-means?

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. More…
Legend
  • 113,822
  • 119
  • 272
  • 400
38
votes
5 answers

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line…
JamesT
  • 417
  • 2
  • 6
  • 8
38
votes
6 answers

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as…
Belinda Chiera
  • 417
  • 1
  • 5
  • 7
37
votes
2 answers

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset) This is how I decide which entity belongs to which cluster: for i in range(len(prediction)): …
Dark Knight
  • 869
  • 1
  • 9
  • 18
37
votes
5 answers

sklearn agglomerative clustering linkage matrix

I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering. However, sklearn.AgglomerativeClustering doesn't return the distance between…
Presian Abarov
  • 373
  • 1
  • 3
  • 7
37
votes
4 answers

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a…
36
votes
4 answers

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!,…
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
36
votes
3 answers

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean…
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
35
votes
6 answers

How to group latitude/longitude points that are 'close' to each other?

I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet. At first it seemed I could just group by rows that have the same latitude/longitude…
Tim Lytle
  • 17,549
  • 10
  • 60
  • 91
35
votes
2 answers

Extracting clusters from seaborn clustermap

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results). However I am having trouble figuring out how to programmatically extract the clusters. For instance, in the example link,…
sedavidw
  • 11,116
  • 13
  • 61
  • 95
34
votes
1 answer

Cluster one-dimensional data optimally?

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?
Laciel
  • 367
  • 1
  • 3
  • 6
34
votes
17 answers

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria. Imagine you are managing newspaper delivery persons. You have a set of street addresses, each of which is geocoded. You want to cluster the addresses so that…
carrier
  • 32,209
  • 23
  • 76
  • 99