Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
26
votes
6 answers

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has…
26
votes
1 answer

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below: documents =…
26
votes
3 answers

Understanding concept of Gaussian Mixture Models

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means. Here is what I have understood, please let me know if my concept is wrong: GMM is like…
26
votes
2 answers

Estimation of number of Clusters via gap statistics and prediction strength

I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead…
Riyaz
  • 1,430
  • 2
  • 17
  • 27
26
votes
3 answers

Clustering values by their proximity in python (machine learning?)

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set. The sorted output is something like…
PCoelho
  • 7,850
  • 11
  • 31
  • 36
25
votes
2 answers

Hierarchical clustering of 1 million objects

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange. hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but…
25
votes
2 answers

Group n points in k clusters of equal size

Possible Duplicate: K-means algorithm variation with equal cluster size EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that cover this one:…
Pierre-David Belanger
  • 1,004
  • 1
  • 11
  • 19
24
votes
5 answers

Distributed hierarchical clustering

Are there any algorithms that can help with hierarchical clustering? Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes. Other resource that I…
Roman
  • 13,100
  • 2
  • 47
  • 63
24
votes
4 answers

Changes of clustering results after each time run in Python scikit-learn

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I…
user3430235
  • 419
  • 1
  • 4
  • 12
24
votes
3 answers

Clustering text in Python

I need to cluster some text documents and have been researching various options. It looks like LingPipe can cluster plain text without prior conversion (to vector space etc), but it's the only tool I've seen that explicitly claims to work on…
Dan
  • 1,677
  • 5
  • 19
  • 34
23
votes
2 answers

What is the difference between a Confusion Matrix and Contingency Table?

I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements…
MangMang
  • 427
  • 1
  • 5
  • 17
23
votes
6 answers

Merge related words in NLP

I'd like to define a new word which includes count values from two (or more) different words. For example: Words Frequency 0 mom 250 1 2020 151 2 the 124 3 19 82 4 mother 81 ... ... ... 10 London 6 11 life 6 12 something 6 I…
user13623188
22
votes
2 answers

How does pytorch backprop through argmax?

I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid. To identify which centroid is nearest to each point, I use…
22
votes
2 answers

Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as…
22
votes
3 answers

Detecting object regions in image opencv

We're currently trying to detect the object regions in medical instruments images using the methods available in OpenCV, C++ version. An example image is shown below: Here are the steps we're following: Converting the image to gray…
Maystro
  • 2,907
  • 8
  • 36
  • 71