Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
9
votes
1 answer

Dendrogram or Other Plot from Distance Matrix

I have three matrices to compare. Each of them is 5x6. I originally wanted to use hierarchical clustering to cluster the matrices, such that the most similar matrices are grouped, given a threshold of similarity. I could not find any such functions…
amc
  • 813
  • 1
  • 15
  • 28
9
votes
3 answers

Sklearn : Mean Distance from Centroid of each cluster

How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid…
Rezwan
  • 1,203
  • 1
  • 7
  • 22
9
votes
1 answer

How to do clustering using the matrix of correlation coefficients?

I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix? Can I use linkage and fcluster function in SciPy? Linkage function needs n * m matrix (according to tutorial), but I want to use n*n…
Siny
  • 91
  • 1
  • 1
  • 3
9
votes
1 answer

Plotting the boundaries of cluster zone in Python with scikit package

Here is my simple example of dealing with data clustering in 3 attribute(x,y,value). each sample represent its location(x,y) and its belonging variable. My code was post here: x = np.arange(100,200,1) y = np.arange(100,200,1) value =…
Han Zhengzu
  • 3,694
  • 7
  • 44
  • 94
9
votes
1 answer

Sklearn AffinityPropagation MemoryError

I think I already know my answer but there's a lot smarter and experienced people out there than me so I wanted to ask. I'm running into MemoryError when trying to fit my hash_matrix () to AffinityPropagation. …
Jarad
  • 17,409
  • 19
  • 95
  • 154
9
votes
2 answers

Estimate the minimum Distance between two Clusters

I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance…
Paul Chernoch
  • 5,275
  • 3
  • 52
  • 73
9
votes
2 answers

Efficient algorithm to group points in clusters by distance between every two points

I am looking for an efficient algorithm for the following problem: Given a set of points in 2D space, where each point is defined by its X and Y coordinates. Required to split this set of points into a set of clusters so that if distance between two…
ovk
  • 2,318
  • 1
  • 23
  • 30
9
votes
1 answer

User profiling with Mahout from categorized user behavior

I'm trying to cluster and classify users with Mahout. At the moment I am at the planning phase, my mind is completely mixed with ideas, and since I'm relatively new to the area I'm stuck at the data formatting. Let's say we have two data table (big…
Turcia
  • 653
  • 1
  • 12
  • 29
9
votes
3 answers

Clustering a large, very sparse, binary matrix in R

I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete: d <- dist(inputMatrix, method="binary") hc <-…
Matt LaFave
  • 569
  • 6
  • 17
9
votes
4 answers

k-means clustering in R on very large, sparse matrix?

I am trying to do some k-means clustering on a very large matrix. The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row). The whole thing does not fit into memory, so I converted it into a sparse…
movingabout
  • 343
  • 3
  • 10
9
votes
3 answers

Algorithm for clustering with minimum size constraints

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched…
qshng
  • 887
  • 1
  • 13
  • 32
9
votes
1 answer

K means clustering for multidimensional data

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers…
Suvidha
  • 429
  • 1
  • 6
  • 17
9
votes
4 answers

Python KMeans clustering words

I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein. 1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require…
sadawd
  • 399
  • 1
  • 4
  • 9
9
votes
3 answers

Clustering algorithm in R for missing categorical and numerical values

I want to perform marketing segmentation clustering on a dataset with missing categorical and numerical values in R. I cannot perform k-means clustering because of the missing values. R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin13.1.0…
Scott Davis
  • 983
  • 6
  • 22
  • 43
9
votes
1 answer

How to add ColSideColors on heatmap.2 after performing bi-clustering (row and column)

I have the following code: library(gplots) library(RColorBrewer); setwd("~/Desktop") mydata <- mtcars hclustfunc <- function(x) hclust(x, method="complete") distfunc <- function(x) dist(x,method="euclidean") d <- distfunc(mydata) fit <-…
pdubois
  • 7,640
  • 21
  • 70
  • 99