Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
10
votes
2 answers

C/C++ Machine Learning Libraries for Clustering

What are some C/c++ Machine learning libraries that supports clustering of multi dimensional data? (for example K-Means) So far I have come across SGI MLC++ http://www.sgi.com/tech/mlc/ OpenCV MLL I am tempted to roll-my-own, but I am sure…
The Unknown
  • 19,224
  • 29
  • 77
  • 93
10
votes
3 answers

Server-side clustering for google maps api v3

I am currently developing a kind of google maps overview widget that displays locations as markers on the map. The amount of markers varies from several hundreds up to thousands of markers (10000 up). Right now I am using MarkerClusterer for google…
10
votes
3 answers

k-means clustering implementation in Javascript?

I'm in need for a Javascript implementation of the k-means clustering algorithm. I only have 1-dimensional data and rarely more than 100 items, so performance is not an issue. PS: I could only find one but it seems extremely unsteady, resulting in…
stephanos
  • 3,319
  • 7
  • 33
  • 47
10
votes
2 answers

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs. Jobs Attributes are: Categorical: position,diploma, skills Numerical : salary , years of experience My question is: how to calculate the distance between different jobs? e.g…
Mariya
  • 847
  • 1
  • 9
  • 25
10
votes
4 answers

Clustering using Latent Dirichlet Allocation algo in gensim

Is it possible to do clustering in gensim for a given set of inputs using LDA? How can I go about it?
Sharmila
  • 1,637
  • 2
  • 23
  • 30
10
votes
4 answers

K-means with really large matrix

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb…
Delphine
  • 1,113
  • 5
  • 15
  • 22
10
votes
2 answers

How to get the centroids in DBSCAN sklearn?

I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans. However, I observed that DBSCAN has something called core points. I am…
EmJ
  • 4,398
  • 9
  • 44
  • 105
10
votes
3 answers

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with…
CoolSteve
  • 261
  • 1
  • 4
  • 11
10
votes
2 answers

Should we used k-means++ instead of k-means?

The k-means++ algorithm helps in two following points of the original k-means algorithm: The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k). The approximation…
Karl
  • 5,613
  • 13
  • 73
  • 107
10
votes
5 answers

How to generate Bad Random Numbers

I'm sure the opposite has been asked many times but I couldn't find any answers on how to generate bad random numbers. I want to write a small program for cluster analysis and want to generate some random Points for testing. If I would just insert…
Nicolas
  • 1,828
  • 6
  • 23
  • 34
10
votes
1 answer

Understanding DynamicTreeCut algorithm for cutting a dendrogram

A dendrogram is a data structure used with hierarchical clustering algorithms that groups clusters at different "heights" of a tree - where the heights correspond to distance measures between clusters. After a dendrogram is created from some input…
10
votes
0 answers

Using precision recall metric on a hierarchy of recovered clusters

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will…
10
votes
5 answers

3D clustering Algorithm

Problem Statement: I have the following problem: There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any…
Teng Lin
  • 129
  • 1
  • 1
  • 6
10
votes
4 answers

Using Silhouette Clustering in Spark

I want to use silhouette to determine optimal value for k when using KMeans clustering in Spark. Is there any optimal way parallelize this? i.e. make it scalable
10
votes
1 answer

How to spread out community graph made by using igraph package in R

Trying to find communities in tweet data. The cosine similarity between different words forms the adjacency matrix. Then, I created graph out of that adjacency matrix. Visualization of the graph is the task here: # Document Term Matrix dtm =…
magarwal
  • 564
  • 4
  • 17