Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
8
votes
5 answers

Algorithm for clustering pictures based on date taken

Anyone know of an algorithm that will group pictures into events based on the date the picture was taken. Obviously I can group by the date, but I'd like something a little more sophisticated that would(might) be able to group pictures spanning…
Greg Dean
  • 29,221
  • 14
  • 67
  • 78
8
votes
4 answers

WEKA K-Means Clustering

Can anybody explain what the output of the K-Means clustering in WEKA actually means. For example kMeans Number of iterations: 9 Within cluster sum of squared errors: 9434.911100488926 Missing values globally replaced with mean/mode Cluster…
Chris Taylor
  • 107
  • 1
  • 1
  • 3
8
votes
3 answers

Clustering images using unsupervised Machine Learning

I have a database of images that contains identity cards, bills and passports. I want to classify these images into different groups (i.e identity cards, bills and passports). As I read about that, one of the ways to do this task is clustering…
8
votes
1 answer

What are noisy samples in Scikit's DBSCAN clustering algorithm?

If I apply Scikit's DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on a similarity matrix, I get a series of labels back. Some of these labels are -1. The documentation calls them noisy samples. What are…
Auxiliary
  • 2,687
  • 5
  • 37
  • 59
8
votes
4 answers

how to plot a k-distance graph in python

How do I plot (in python) the distance graph for a given value of min-points in DBSCAN??? I am looking for the knee and corresponding epsilon value. In the sklearn I do not see any method that return such distances.... Am I missing something?
Mauro Gentile
  • 1,463
  • 6
  • 26
  • 37
8
votes
3 answers

clustering list of words in python

I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate…
Kevin Lee
  • 401
  • 3
  • 9
  • 22
8
votes
2 answers

Scikit-learn, KMeans: How to use max_iter

I'd like to understand the parameter max_iter from the class sklearn.cluster.KMeans. According to the documentation: max_iter : int, default: 300 Maximum number of iterations of the k-means algorithm for a single run. But in my opinion if I have…
C-Jay
  • 621
  • 1
  • 11
  • 22
8
votes
2 answers

permuting the rows and columns of a matrix for clustering

i have a distance matrix that is 1000x1000 in dimension and symmetric with 0s along the diagonal. i want to form groupings of distances (clusters) by simultaneously reordering the rows and columns of the matrix. this is like reordering a matrix…
user439463
  • 91
  • 1
  • 5
8
votes
2 answers

opencv euclidean clustering vs findContours

I have the following image mask: I want to apply something similar to cv::findContours, but that algorithm only joins connected points in the same groups. I want to do this with some tolerance, i.e., I want to add the pixels near each other within…
manatttta
  • 3,054
  • 4
  • 34
  • 72
8
votes
1 answer

In wildlfy9, how to make stateful ejb session replication with two node in standalone mode(Clustering)

I want to do clustering with ear project. I found one solution to run standalone in clustering using standalone-ha.xml configuration. I followed below article. It's working fine. Clustering in domain mode with wildfly9 But I want to run ERP project…
8
votes
2 answers

Choosing the number of clusters in heirarchical agglomerative clustering with scikit

The wikipedia article on determining the number of clusters in a dataset indicated that I do not need to worry about such a problem when using hierarchical clustering. However when I tried to use scikit-learn's agglomerative clustering I see that I…
8
votes
3 answers

How to find Local maxima in Kernel Density Estimation?

I'm trying to make a filter (to remove outlier and noise) using kernel density estimators(KDE). I applied KDE in my 3D (d=3) data points and that gives me the probability density function (PDF) f(x). Now as we know local maxima of density estimation…
jquery404
  • 653
  • 1
  • 12
  • 26
8
votes
1 answer

Affinity Propagation (sklearn) - strange behavior

Trying to use affinity propagation for a simple clustering task: from sklearn.cluster import AffinityPropagation c = [[0], [0], [0], [0], [0], [0], [0], [0]] af = AffinityPropagation (affinity = 'euclidean').fit (c) print (af.labels_) I get this…
Baba
  • 161
  • 1
8
votes
2 answers

Clustering Categorical data using jaccard similarity

I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and…
Sam
  • 2,545
  • 8
  • 38
  • 59
8
votes
1 answer

clusplot - showing variables

I would like to add to a clusplot plot the variables used for pca as arrows. I am not sure that a way has been implemented (I can't find anything in the documentation). I have produced a clusplot that looks like this: With the package princomp I…
Dario Lacan
  • 1,099
  • 1
  • 11
  • 25