Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
16
votes
2 answers

Google Maps API v3, lots of markers, clustering and performance

I have about 5000 markers I need to render on Google Map. I'm currently using the API (v3) and there are performance issues on slower machines, especially in IE. I have done the following already to help speed things up: Used a simple marker class…
16
votes
1 answer

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater…
winwaed
  • 7,645
  • 6
  • 36
  • 81
16
votes
7 answers

Clustering Lat/Longs in a Database

I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application. There are a number of resources about how to cluster, either on the client side OR in…
Pure.Krome
  • 84,693
  • 113
  • 396
  • 647
16
votes
2 answers

How do I manually create a dendrogram (or "hclust") object ? (in R)

I have a dendrogram given to me as images. Since it is not very large, I can construct it "by hand" into an R object. So my question is how do I manually create a dendrogram (or "hclust") object when all I have is the dendrogram image? I see that…
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
16
votes
2 answers

What is the state-of-the-art in unsupervised learning on temporal data?

I'm looking for an overview of the state-of-the-art methods that find temporal patterns (of arbitrary length) in temporal data and are unsupervised (no labels). In other words, given a steam/sequence of (potentially high-dimensional) data, how do…
16
votes
2 answers

Plotting dendrogram in Scipy error for large dataset

I am using Scipy for hierarchial clustering. I do manage to get flat clusters on a threshold using fcluster. But I need to visualize the dendrogram formed. When I use the dendrogram method, it works fine for 5-6k user vectors. But my dataser…
Maxwell
  • 409
  • 1
  • 6
  • 19
15
votes
3 answers

Equivalent of Matlab's cluster quality function?

MATLAB has a nice silhouette function to help evaluate the number of clusters for k-means. Is there an equivalent for Python's Numpy/Scipy as well?
Legend
  • 113,822
  • 119
  • 272
  • 400
15
votes
2 answers

DBSCAN with custom metric

I have the following given: a dataset in the range of thousands a way of computing the similarity, but the datapoints themselves I cannot plot them in euclidian space I know that DBSCAN should support custom distance metric but I dont know how to…
zython
  • 1,176
  • 4
  • 22
  • 50
15
votes
1 answer

initial centroids for scikit-learn kmeans clustering

if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class this post (k-means with selected initial centers) indicates that I only need to set…
webmaker
  • 456
  • 1
  • 5
  • 15
15
votes
2 answers

Incremental clustering algorithm for grouping news articles?

I'm doing a little research on how to cluster articles into 'news stories' ala Google News. Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words…
Peter
  • 153
  • 1
  • 4
15
votes
3 answers

How to use NLP to separate a unstructured text content into distinct paragraphs?

The following unstructured text has three distinct themes -- Stallone, Philadelphia and the American Revolution. But which algorithm or technique would you use to separate this content into distinct paragraphs? Classifiers won't work in this…
user193116
  • 3,498
  • 6
  • 39
  • 58
15
votes
1 answer

overplot multiple sets of data with hexbin

I am doing some KMeans clustering on a large and really dense data set and I am trying to figure out the best way to visualize the clusters. In 2D, it looks like hexbin would do a good job but I am unable to overplot the clusters on the same…
Labibah
  • 5,371
  • 6
  • 25
  • 23
15
votes
2 answers

How to identify Cluster labels in kmeans scikit learn

I am learning python scikit. The example given here displays the top occurring words in each Cluster and not Cluster name. http://scikit-learn.org/stable/auto_examples/document_clustering.html I found that the km object has "km.label" which lists…
vij555
  • 329
  • 1
  • 2
  • 10
15
votes
4 answers

How to find cluster sizes in 2D numpy array?

My problem is the following, I have a 2D numpy array filled with 0 an 1, with an absorbing boundary condition (all the outer elements are 0) , for example: [[0 0 0 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0 0 0] [0 0 1 0 1 0 0 0 1 0] [0 0 0 0 0 0 1 0 1 0] …
Cecilia
  • 487
  • 3
  • 6
  • 14
15
votes
3 answers

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their…