Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
19
votes
4 answers

How to get flat clustering corresponding to color clusters in the dendrogram created by scipy

Using the code posted here, I created a nice hierarchical clustering: Let's say the the dendrogram on the left was created by doing something like Y = sch.linkage(D, method='average') # D is a distance matrix cutoff = 0.5*max(Y[:,2]) Z =…
conradlee
  • 12,985
  • 17
  • 57
  • 93
19
votes
3 answers

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'q')

I am trying to apply Gower distance implementation to my data frame. While it was smoothly working with the same dataset with more features, this time it gives an error when I call the Gower distance function. I import the Gower's function from…
Beg
  • 405
  • 1
  • 5
  • 18
19
votes
1 answer

How to add k-means predicted clusters in a column to a dataframe in Python

I have a question about kmeans clustering in python. So I did the analysis that way: from sklearn.cluster import KMeans km = KMeans(n_clusters=12, random_state=1) new =…
Keithx
  • 2,994
  • 15
  • 42
  • 71
19
votes
5 answers

How to calculate BIC for k-means clustering in R

I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I've been using in R is: KClData <- kmeans(Data,…
UnivStudent
  • 402
  • 1
  • 3
  • 11
17
votes
5 answers

How would you group/cluster these three areas in arrays in python?

So you have an array 1 2 3 60 70 80 100 220 230 250 For a better understanding: How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing [1 2 3] [60 70 80 100] [220 230…
Zurechtweiser
  • 1,165
  • 2
  • 16
  • 29
17
votes
3 answers

Algorithm for fitting objects in a space

I have a collection of different sized squares and rectangles that I want to fit together using PHP into one large square/rectangle. The squares are usually images that I want to make into a montage - but sometimes they are simply math objects. Are…
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
17
votes
4 answers

Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves…
Doni
  • 173
  • 1
  • 1
  • 4
17
votes
2 answers

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell…
17
votes
4 answers

Scikit K-means clustering performance measure

I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I'm not an expert but I am eager to learn more about clustering. Here is my code : import pandas as pd from sklearn import…
17
votes
1 answer

How to make TF-IDF matrix dense?

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
17
votes
2 answers

Cosine similarity when one of vectors is all zeros

How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity ) when one of the vectors is all zeros? v1 = [1, 1, 1, 1, 1] v2 = [0, 0, 0, 0, 0] When we calculate according to the classic formula we get division by zero: Let…
17
votes
5 answers

How do I predict new data's cluster after clustering training data?

I have already trained my clustering model using hclust: model=hclust(distances,method="ward”) And the result looks good: Now I get some new data records, I want to predict which cluster every one of them belongs to. How do I get it done ?
WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138
17
votes
5 answers

What does the Brown clustering algorithm output mean?

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token.…
alvas
  • 115,346
  • 109
  • 446
  • 738
17
votes
2 answers

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which…
user139014
  • 1,445
  • 2
  • 19
  • 33
16
votes
8 answers

How can I cluster a graph in Python?

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine,…
Pietro Speroni
  • 3,131
  • 11
  • 44
  • 55