Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
2
votes
2 answers

non density based Data clustering algorithm

I'm working on a cluster analysis program that takes a set of points S as an input and labels each point with that index of the cluster it belong to. I've implemented the DBScan and OPTICS algorithms and they both work as expected. However, the…
dotminic
  • 1,135
  • 2
  • 14
  • 28
2
votes
1 answer

Effect of mat2gray on multithresh

I do not get why a segmentation obtained by using multithresh on an "original" double image is different from a segmentation using the same parameters on the same image scaled by mat2gray. E.g.: testimage = randi(100,[200…
2
votes
2 answers

revealing clusters of interaction in igraph

I have an interaction network and I used the following code to make an adjacency matrix and subsequently calculate the dissimilarity between the nodes of the network and then cluster them to form…
johnny utah
  • 269
  • 3
  • 17
2
votes
3 answers

DBSCAN vs OPTICS for Automatic Clustering

I know that DBSCAN requires two parameters (minPts and Eps). However, I am confused on what parameters are needed for OPTICS because some sources say it requires eps while others say it only requires minPts. Which algorithm would be the better to…
user3315340
  • 155
  • 1
  • 3
  • 12
2
votes
1 answer

XMeans ELKI fails at every third input file

I'm trying to cluster image data (stored in 100 separate csv files) with ELKI's XMeans algorithm. It works well for the first two files, but then the algorithm hangs on forever while processing the third file. It looks like the problem occurs at…
Charlie28000
  • 67
  • 1
  • 5
2
votes
2 answers

How to use WeightedCluster::wcKMedoids to provide clustering for heatmap or heatmap.2 in R?

TL;DR: How to use the WeightedCluster library (the wcKMedoids() method in particular) as input to heatmap, heatmap.2 or similar, to provide it with clustering info? We are creating a heatmap from some binary data (yes/no values, represented as ones…
Samuel Lampa
  • 4,336
  • 5
  • 42
  • 63
2
votes
1 answer

Choose the number of clusters and vertices in python igraph

I have a complete weighted graph as you can see in the image below: The Goal: My goal is to be able to choose the number of clusters and the number of vertices in each cluster using python's implementation of iGraph What I've Tried So Far: import…
jackzellweger
  • 399
  • 1
  • 7
  • 20
2
votes
1 answer

Text clustering using arbitrary metrics with sklearn kmeans

I'm running text clustering on a table that contains medical terms, I want to cluster strings that have similar words, if two have have two words or more, should be included in one cluster more likely than if they only have one word in common. I…
Lelo
  • 347
  • 3
  • 16
2
votes
0 answers

K-modes clustering in R for categorical data with NAs

dat <- data.frame(x=sample(letters[1:3],20,TRUE),y=sample(LETTERS[7:9],20,TRUE),stringsAsFactors=FALSE) dat[c(1:5,9,17,20),1] <- NA;dat[c(8,11),2] <- NA dat x y 1 H 2 I 3 G 4 H 5 I 6 c H 7…
Roy C
  • 197
  • 2
  • 12
2
votes
1 answer

Are there advantages of using sklearn KMeans versus SciPy kmeans?

From the documentation of sklearn KMeans class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1) and SciPy…
pepe
  • 9,799
  • 25
  • 110
  • 188
2
votes
1 answer

Determining effects of clustering

In clustering what effects does noisy,redundant, and irrelevant attributes have on it? Do they end up helping or hurting clustering?I know that it is unable to handle noisy data but not sure on the other two.
chris551
  • 21
  • 2
2
votes
2 answers

Is this the expected behavior of the DBSCAN algorithm (two identical data samples not fitting in the same cluster)?

Please forgive the lack of formal terms, I've only recently approached ML. For learning purposes, I decided to try a Ruby implementation of the DBSCAN algorithm (https://github.com/matiasinsaurralde/dbscan). Building on the simple example at…
Redoman
  • 3,059
  • 3
  • 34
  • 62
2
votes
0 answers

Why results are different in hclust and heat map.2 using same clustering functions?

I'm trying to understand a bit more my data doing some clustering analysis. Using the same data, I've done first a hclust with this code: # Dissimilarity matrix df <-scale(m.sel) d <- dist(df, method = "euclidean") # Hierarchical clustering using…
2
votes
1 answer

String clustering in Python

I have a list of strings and I want to classify it by using clustering in Python. list = ['String1', 'String2', 'String3',...] I want to use Levenshtein distance, so I used jellyfish library. Given two strings, I know that their distance can be…
Muny
  • 87
  • 2
  • 7
2
votes
1 answer

Computation of clusters

I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26…
patrick
  • 4,455
  • 6
  • 44
  • 61