Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
12
votes
1 answer

Scipy's sparse eigsh() for small eigenvalues

I'm trying to write a spectral clustering algorithm using NumPy/SciPy for larger (but still tractable) systems, making use of SciPy's sparse linear algebra library. Unfortunately, I'm running into stability issues with eigsh(). Here's my…
Magsol
  • 4,640
  • 11
  • 46
  • 68
11
votes
1 answer

Bisecting k-means clustering algorithm explanation

I was required to write a bisecting k-means algorithm, but I didnt understand the algorithm. I know k-means algorithm. Can you explain the algorithm, but not in academic language Thanks.
Nir
  • 2,497
  • 9
  • 42
  • 71
11
votes
8 answers

ALGORITHM - String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are…
Ajay
  • 4,134
  • 3
  • 20
  • 19
11
votes
3 answers

Affinity Propagation preferences initialization

I need to perform clustering without knowing in advance the number of clusters. The number of cluster may be from 1 to 5, since I may find cases where all the samples belong to the same instance, or to a limited number of group. I thought affinity…
11
votes
1 answer

DIvisive ANAlysis (DIANA) Hierarchical Clustering

(This post is continuation of my previous question on divisive hierarchical clustering algorithm.) The problem is how to implement this algorithm in Python (or any other language). Algorithm description A divisive clustering proceeds by a series of…
Andrej
  • 3,719
  • 11
  • 44
  • 73
11
votes
4 answers

Python: DBSCAN in 3 dimensional space

I have been searching around for an implementation of DBSCAN for 3 dimensional points without much luck. Does anyone know I library that handles this or has any experience with doing this? I am assuming that the DBSCAN algorithm can handle 3…
user2909415
  • 979
  • 3
  • 10
  • 26
11
votes
1 answer

PCA multiplot in R

I have a dataset that looks like this: India China Brasil Russia SAfrica Kenya States Indonesia States Argentina Chile Netherlands HongKong 0.0854026763 0.1389383234 0.1244184371 0.0525460881 0.2945586244 0.0404562539 …
Angelo
  • 4,829
  • 7
  • 35
  • 56
11
votes
4 answers

clustering image segments in opencv

I am working on motion detection with non-static camera using opencv. I am using a pretty basic background subtraction and thresholding approach to get a broad sense of all that's moving in a sample video. After thresholding, I enlist all separable…
Ekansh Gupta
  • 417
  • 1
  • 5
  • 14
11
votes
6 answers

Clustered Graphs Visualization Techniques

I need to visualize a relatively large graph (6K nodes, 8K edges) that has the following properties: Distinct Clusters. Approximately 50-100 Nodes per cluster and moderate interconnectivity at the cluster level Minimal (5-10 inter-cluster edges per…
jameszhao00
  • 7,213
  • 15
  • 62
  • 112
11
votes
3 answers

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit…
wishihadabettername
  • 14,231
  • 21
  • 68
  • 85
11
votes
2 answers

Estimating/Choosing optimal Hyperparameters for DBSCAN

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of…
Riyaz
  • 1,430
  • 2
  • 17
  • 27
11
votes
2 answers

How can i cluster document using k-means (Flann with python)?

I want to cluster documents based on similarity. I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but…
Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55
11
votes
3 answers

Find groups with high cross correlation matrix in Matlab

Given a lower triangular matrix (100x100) containg cross-correlation values, where entry 'ij' is the correlation value between signal 'i' and 'j' and so a high value means that these two signals belong to the same class of objects, and knowing there…
user1641496
  • 457
  • 1
  • 8
  • 18
10
votes
1 answer

Networkx graph clustering

in Networkx, how can I cluster nodes based on nodes color? E.g., I have 100 nodes, some of them are close to black, while others are close to white. In the graph layout, I want nodes with similar color stay close to each other, and nodes with very…
Geni
  • 687
  • 3
  • 10
  • 22
10
votes
1 answer

R Clustering 'purity' metric

I am using fpc package in R to perform cluster validation. I could use the function cluster.stats() to compare my clustering with an external partitioning and compute several metrics like Rand Index, entropy e.t.c. However, I am looking for a…
chet
  • 419
  • 6
  • 15