Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
15
votes
3 answers

Efficient k-means evaluation with silhouette score in sklearn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to…
moustachio
  • 2,924
  • 3
  • 36
  • 68
15
votes
4 answers

Trajectory Clustering: Which Clustering Method?

As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise. In addition, not all of them are…
Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174
15
votes
8 answers

Efficient way of calculating likeness scores of strings when sample size is large?

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how…
matt b
  • 138,234
  • 66
  • 282
  • 345
14
votes
2 answers

R: How to overlay pie charts on 'dots' in a scatterplot in R

Using R I would like to replace the points in a 2d scatter plot by a pie chart displaying additional values. The rational behind is that I have time series data for hundreds of elements (proteins) derived from a biological experiment monitored for 4…
philipp
  • 143
  • 1
  • 4
14
votes
1 answer

How to specify distance metric while for kmeans in R?

I'm doing kmeans clustering in R with two requirements: I need to specify my own distance function, now it's Pearson Coefficient. I want to do the clustering that uses average of group members as centroids, rather some actual member. The reason for…
Derrick Zhang
  • 21,201
  • 18
  • 53
  • 73
14
votes
5 answers

Java machine learning library for commercial use?

Does anyone know a good Java machine learning library I can use for a commercial product? Weka and Rapidminer unfortunately do not allow this. I already found Apache Mahout and Java Data Mininng Package. Has anyone experience with them and provide…
WorstCase
  • 325
  • 4
  • 13
14
votes
5 answers

Graph Theory: Calculating Clustering Coefficient

I'm doing some research and I've come to a point where I have calculate the clustering coefficient of a graph. According to this paper directly related to my research: The clustering coefficient C(p) is defined as follows. Suppose that a vertex v…
Griffin
  • 13,184
  • 4
  • 29
  • 43
14
votes
6 answers

How do I create a radial cluster like the following code-example in Python?

I've found several examples on how to create these exact hierarchies (at least I believe they are) like the following here stackoverflow.com/questions/2982929/ which work great, and almost perform what I'm looking for. [EDIT]Here's a simplified…
T Carrasco
  • 463
  • 2
  • 6
  • 16
14
votes
2 answers

Interest and location based algorithm for android mobile app

I am trying to work on android mobile app where I have a functionality to find matches according to interest and location. Many dating apps are already doing some kinda functionality for example Tinder matches based on locations, gender and age…
N Sharma
  • 33,489
  • 95
  • 256
  • 444
14
votes
2 answers

Image clustering by its similarity in python

I have a collection of photos and I'd like to distinguish clusters of the similar photos. Which features of an image and which algorithm should I use to solve my task?
alex
  • 942
  • 1
  • 10
  • 26
14
votes
1 answer

How to Bound the Outer Area of Voronoi Polygons and Intersect with Map Data

Background I'm trying to visualize the results of a kmeans clustering procedure on the following data using voronoi polygons on a US map. Here is the code I've been running so far: input <- read.csv("LatLong.csv", header = T, sep = ",") # K Means…
Rick Arko
  • 680
  • 1
  • 8
  • 27
14
votes
4 answers

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set. I could use the function score() to compute the log probability under the model. However, I am looking for a metric called 'purity' which is defined…
Kuka
  • 141
  • 1
  • 1
  • 6
14
votes
3 answers

What is a convenient way to do document clustering with elasticsearch?

I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered…
asmaier
  • 11,132
  • 11
  • 76
  • 103
14
votes
1 answer

Approaches for spatial geodesic latitude longitude clustering in R with geodesic or great circle distances

I would like to apply some basic clustering techniques to some latitude and longitude coordinates. Something along the lines of clustering (or some unsupervised learning) the coordinates into groups determined either by their great circle distance…
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
14
votes
3 answers

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal. There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using…