Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
22
votes
6 answers

scikit-learn: Finding the features that contribute to each KMeans cluster

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters? What I want to be able to say is that for cluster k1, features 1,4,6 were the primary…
cmgerber
  • 2,199
  • 3
  • 16
  • 15
22
votes
2 answers

scikit-learn: clustering text documents using DBSCAN

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm.…
22
votes
7 answers

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification? If I have done a simple k-means clustering . Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is…
Sirius Wang
  • 339
  • 1
  • 5
  • 15
22
votes
2 answers

DBSCAN in scikit-learn of Python: save the cluster points in an array

following the example Demo of DBSCAN clustering algorithm of Scikit Learning i am trying to store in an array the x, y of each clustering class import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from…
Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
22
votes
8 answers

Map Clustering Algorithm

My current code is pretty quick, but I need to make it even faster so we can accommodate even more markers. Any suggestions? Notes: The code runs fastest when the SQL statement is ordered by marker name - which itself does a very partial job of…
Chris B
  • 15,524
  • 5
  • 33
  • 40
21
votes
7 answers

Multidimensional Euclidean Distance in Python

I want to calculate the Euclidean distance in multiple dimensions (24 dimensions) between 2 arrays. I'm using numpy-Scipy. Here is my code: import numpy,scipy; A=numpy.array([116.629, 7192.6, 4535.66, 279714, 176404, 443608, 295522, 1.18399e+07,…
garak
  • 4,713
  • 9
  • 39
  • 56
21
votes
5 answers

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update…
Jaz
  • 581
  • 2
  • 6
  • 10
21
votes
8 answers

Java Clustering Library

I am looking for a light weight clustering library in java. I don't need 100s of clustering algo in that library just 5 to 7 algo would be fine for me. I am sure, you are going to ask: "what kind of algo do you need and for what purpose" :). I just…
user238384
  • 2,396
  • 10
  • 35
  • 36
21
votes
3 answers

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are…
Jack Twain
  • 6,273
  • 15
  • 67
  • 107
21
votes
1 answer

How can I fix a MemoryError when executing scikit-learns silhouette score?

I run a clustering algorithm and want to evaluate the result by using silhouette score in scikit-learn. But in the scikit-learn, it needs to calculate the distance matrix: distances = pairwise_distances(X, metric=metric, **kwds) Due to the fact that…
20
votes
1 answer

How to get Agglomerative Clustering "Centroid" in python Scikit-learn

This code is what I am using for silhouette_score. And in here I am using Agglomerative Clustering, linkage as Ward. I would like to get "Centroid" of Agglomerative Clustering, would it be possible from Agglomerative Clustering? I could only get…
Pandalove
  • 201
  • 1
  • 2
  • 3
20
votes
6 answers

Grouping similar news contents together like in GOOGLE NEWS

I am unable to manage the RSS feeds easily due to an overwhelming number of new stories / similar news contents posted in various news sites. For subjects such as world news and business news, many of the stories are redundant, adding a burden to…
Gourav
  • 209
  • 2
  • 3
20
votes
1 answer

Clustering cosine similarity matrix

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B…
Stefan D
  • 1,229
  • 2
  • 15
  • 29
20
votes
1 answer

How to use 'hclust' as function call in R

I tried to construct the clustering method as function the following ways: mydata <- mtcars # Here I construct hclust as a function hclustfunc <- function(x) hclust(as.matrix(x),method="complete") # Define distance metric distfunc <- function(x)…
neversaint
  • 60,904
  • 137
  • 310
  • 477
19
votes
4 answers

Best clustering algorithm? (simply explained)

Imagine the following problem: You have a database containing about 20,000 texts in a table called "articles" You want to connect the related ones using a clustering algorithm in order to display related articles together The algorithm should do…
caw
  • 30,999
  • 61
  • 181
  • 291