Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In machine-learning and data-mining, clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include k-means, expectation maximization (EM), spectral clustering, correlation clustering and hierarchical-clustering.

Related topics: classification, pattern-recognition, knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions

votes

3 answers

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The…

machine-learning scikit-learn cluster-analysis data-mining kernel-density

asked Jan 29 '16 at 21:35

Alex Kinman

2,437
8
32
51

votes

3 answers

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm. Is the probability function used based…

algorithm language-agnostic machine-learning cluster-analysis k-means

asked Mar 28 '11 at 23:45

Anton Andreev

2,052
1
22
23

votes

3 answers

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine. My problem here is that I don't need to…

python scikit-learn cluster-analysis scoring

asked Jan 05 '16 at 11:49

Jamie Bull

12,889
15
77
116

votes

2 answers

Calculating the percentage of variance measure for k-means?

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. More…

python numpy statistics cluster-analysis k-means

asked Jul 11 '11 at 04:55

Legend

113,822
119
272
400

votes

5 answers

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line…

python scikit-learn cluster-analysis data-mining dbscan

asked May 05 '13 at 05:04

JamesT

votes

6 answers

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as…

r data-mining cluster-analysis dbscan

asked Oct 15 '12 at 10:12

Belinda Chiera

votes

2 answers

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset) This is how I decide which entity belongs to which cluster: for i in range(len(prediction)): …

python pandas scikit-learn cluster-analysis k-means

asked Jan 19 '15 at 02:17

Dark Knight

votes

5 answers

sklearn agglomerative clustering linkage matrix

I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering. However, sklearn.AgglomerativeClustering doesn't return the distance between…

python scikit-learn cluster-analysis dendrogram

asked Nov 10 '14 at 19:33

Presian Abarov

votes

4 answers

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a…

r matlab cluster-analysis levenshtein-distance hierarchical-clustering

asked Feb 02 '14 at 14:38

Alexandros

2,160
4
27
52

votes

4 answers

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!,…

string cluster-analysis data-mining

asked Nov 19 '11 at 18:48

Renato Dinhani

35,057
55
139
199

votes

3 answers

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean…

machine-learning cluster-analysis data-mining k-means

asked Feb 07 '14 at 05:08

tumultous_rooster

12,150
32
92
149

votes

6 answers

How to group latitude/longitude points that are 'close' to each other?

I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet. At first it seemed I could just group by rows that have the same latitude/longitude…

sql database geolocation location cluster-analysis

asked Dec 03 '10 at 19:28

Tim Lytle

17,549
10
60
91

votes

2 answers

Extracting clusters from seaborn clustermap

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results). However I am having trouble figuring out how to programmatically extract the clusters. For instance, in the example link,…

python cluster-analysis hierarchical-clustering seaborn dendrogram

asked Jan 13 '15 at 14:48

sedavidw

11,116
13
61
95

votes

1 answer

Cluster one-dimensional data optimally?

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?

r cluster-analysis k-means

asked Oct 23 '11 at 22:12

Laciel

votes

17 answers

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria. Imagine you are managing newspaper delivery persons. You have a set of street addresses, each of which is geocoded. You want to cluster the addresses so that…

algorithm language-agnostic cluster-analysis

asked Feb 18 '09 at 21:25

carrier

32,209
23
76
99

Prev 1

…

99 100 Next