Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
9
votes
4 answers

Check if one regex covers another regex

I'm attempting to implement a text clustering algorithm. The algorithm clusters similar lines of raw text by replacing them with regexes, and aggregates the number of patterns matching each regex so as to provide a neat summary of the input text…
Kowshik
  • 1,541
  • 3
  • 17
  • 25
9
votes
4 answers

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this: fee =…
Justin Grant
  • 44,807
  • 15
  • 124
  • 208
9
votes
1 answer

clustering and matlab

I'm trying to cluster some data I have from the KDD 1999 cup dataset the output from the file looks like…
G Gr
  • 6,030
  • 20
  • 91
  • 184
9
votes
3 answers

How to cluster an instance with Weka's DBSCAN?

I've been trying to use the DBSCAN clusterer from Weka to cluster instances. From what I understand I should be using the clusterInstance() method for this, but to my surprise, when taking a look at the code of that method, it looks like the…
Oak
  • 26,231
  • 8
  • 93
  • 152
9
votes
2 answers

How to pick the T1 and T2 threshold values for Canopy Clustering?

I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need…
Jonathan
  • 111
  • 1
  • 3
9
votes
2 answers

Clustering geospatial data on coordinates AND non spatial feature

Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like: business_lat business_lng business_rating 0 19.111841 72.910729 5. 1 19.111342 72.908387 5. 2 …
9
votes
7 answers

How to compute precision and recall in clustering?

I am really confused how to compute precision and recall in clustering applications. I have the following situation: Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to…
Christian Stade-Schuldt
  • 4,671
  • 7
  • 35
  • 30
9
votes
2 answers

Java text clustering library

Which of the data mining java libraries can do text clusterization?
bme
  • 516
  • 5
  • 13
9
votes
5 answers

Order of rows in heatmap?

Take the following code: heatmap(data.matrix(signals),col=colors,breaks=breaks,scale="none",Colv=NA,labRow=NA) How can I extract, pre-calculate or re-calculate the order of the rows in the heatmap produced? Is there a way to inject the output of…
Ron Gejman
  • 6,135
  • 3
  • 25
  • 34
9
votes
3 answers

how do I cluster a list of geographic points by distance?

I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI). Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such that every member of a cluster is within 20km of every other member in the…
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45
9
votes
1 answer

How to perform clustering on Word2Vec

I have a semi-structured dataset, each row pertains to a single user: id, skills 0,"java, python, sql" 1,"java, python, spark, html" 2, "business management, communication" Why semi-structured is because the followings skills can only be selected…
Ivan
  • 673
  • 2
  • 8
  • 20
9
votes
1 answer

Why is Adjusted rand index(ARI) better than rand index(RI) and how to understand ARI intuitively from the formula

I read the wikipedia article about Rand Index and Adjusted Rand Index. I can understand how they are calculated mathematically and can interpret Rand index as the ration of agreements over disagreements. But I am failing to have same intuition about…
RTM
  • 759
  • 2
  • 9
  • 22
9
votes
3 answers

Rand Index function (clustering performance evaluation)

As far as I know, there is no package available for Rand Index in python while for Adjusted Rand Index you have the option of using sklearn.metrics.adjusted_rand_score(labels_true, labels_pred). I wrote the code for Rand Score and I am going to…
Hadij
  • 3,661
  • 5
  • 26
  • 48
9
votes
1 answer

Infomap community detection understanding

i need a understandable description of the Infomap Community Detection Algorithm. I read the papers, but it was not clear for me. My questions: How does the algorithm basically work? What has random walks to do with it? What is the map equation and…
Sully
  • 169
  • 3
  • 13
9
votes
3 answers

How to perform cluster with weights/density in python? Something like kmeans with weights?

My data is like this: powerplantname, latitude, longitude, powergenerated A, -92.3232, 100.99, 50 B, , , 10 C, , , 20 D, , , 40 E, , , 5 I want to be able to cluster the data into N number of clusters…
Rolando
  • 58,640
  • 98
  • 266
  • 407