Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
34
votes
3 answers

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering. Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering…
Alex Lenail
  • 12,992
  • 10
  • 47
  • 79
33
votes
4 answers

Clustering Algorithm for Mapping Application

I'm looking into clustering points on a map (latitude/longitude). Are there any recommendations as to a suitable algorithm that is fast and scalable? More specifically, I have a series of latitude/longitude coordinates and a map viewport. I'm trying…
33
votes
6 answers

Which machine learning library to use

I am looking for a library that, ideally, has the following features: implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix) implements support vector machines is in C++ is somewhat documented (this…
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
33
votes
2 answers

Reordering matrix elements to reflect column and row clustering in naiive python

I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering problem is easily solvable, so is the dendrogram…
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
32
votes
7 answers

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs). I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster…
32
votes
5 answers

Scikit Learn GridSearchCV without cross validation (unsupervised learning)

Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation. The documentation is also confusing me because under…
32
votes
5 answers

DBSCAN for clustering of geographic location data

I have a dataframe with latitude and longitude pairs. Here is my dataframe look like. order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19.137815 72.914085 4 19.119677 72.905081 5 …
Neil
  • 7,937
  • 22
  • 87
  • 145
31
votes
2 answers

python scikit-learn clustering with missing data

I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column. With scikit-learn, missing data is not possible. There is also no chance to specify a user distance…
Michael Hecht
  • 2,093
  • 6
  • 25
  • 37
31
votes
4 answers

whats is the difference between "k means" and "fuzzy c means" objective functions?

I am trying to see if the performance of both can be compared based on the objective functions they work on?
n0ob
  • 1,275
  • 8
  • 20
  • 23
31
votes
14 answers

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out of Chicago. How can I find the same result in…
Ryan
  • 14,682
  • 32
  • 106
  • 179
30
votes
1 answer

Online k-means clustering

Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with…
Theodor
  • 5,536
  • 15
  • 41
  • 55
30
votes
1 answer

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to…
user248237
28
votes
5 answers

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in…
Jonathan Rhein
  • 1,616
  • 3
  • 23
  • 47
28
votes
1 answer

How to compute cluster assignments from linkage/distance matrices

if you have this hierarchical clustering call in scipy in Python: from scipy.cluster.hierarchy import linkage # dist_matrix is long form distance matrix linkage_matrix = linkage(squareform(dist_matrix), linkage_method) then what's an efficient way…
user248237
27
votes
1 answer

Clustering (fkmeans) with Mahout using Clojure

I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script) format: (tag) (image) (frequency) tag_sit image_a 0 tag_sit image_b…
Jeffrey04
  • 6,138
  • 12
  • 45
  • 68