Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
2
votes
2 answers

How do I view the datapoints that are added to a cluster after applying K-Means algorithm?

I have implemented k-means algorithm in scala as follows. def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= { val clusters = KMeans.train(parsedData, clustnum, iternum) println("The Cluster…
2
votes
1 answer

Adjusted Mutual Information (scikit-learn)

I have implemented a clustering algorithm for summarizing log files, and am currently testing it against ground-truth data with the Adjusted Rand index and the Adjusted Mutual Information index. Input to my algorithm is a list of log entries, and…
2
votes
2 answers

Clustering longitude and latitude gps data

I have more than 400 thousand cars GPS locations, like: [ 25.41452217, 37.94879532], [ 25.33231735, 37.93455887], [ 25.44327736, 37.96868896], ... I need to make spatial clustering with the distance between points <= 3 meters. I tried to use…
M. Smith
  • 21
  • 1
  • 2
2
votes
1 answer

Python K means clustering

I am trying to implement the code on this website to estimate what value of K I should use for my K means clustering. https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/ However I am not getting any success…
piccolo
  • 2,093
  • 3
  • 24
  • 56
2
votes
1 answer

Finding minimum number of required 'central points'

I have a set of 'n' nodes. A function returns a kind of distance between two nodes such that dist(a,c) may not be dist(a,b)+dist(b,c). Based on a threshold I connect certain nodes via edges. I wish to select the minimum number of nodes such that the…
2
votes
1 answer

Printing principal features in clusters (python)

I have a mxn matrix, with m features and n samples. The matrix is called term_individual. The clustering is done using scikitlearn: from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=n_clusters) kmeans.fit(term_individual.T) centroids =…
Vladimir Vargas
  • 1,744
  • 4
  • 24
  • 48
2
votes
1 answer

Cluster groups of face images

I have extracted faces from a video and I clustered them in big groups (each group contains faces from the same person, I did this using change of background detection). Now I want to cluster those groups into a smaller number of groups and to have,…
N. Ruchers
  • 147
  • 1
  • 11
2
votes
1 answer

How to calculate BCubed precision and recall

According to the this published page BCubed precision and recall, thus F1-Measure calculation is the best technique for evaluating clustering performance. See Amigó, Enrique, et al. "A comparison of extrinsic clustering evaluation metrics based on…
2
votes
2 answers

Clusterint 2D points using sklearn KDTree

I have an array of (n_sample x 2) and I want to cluster them using KDTree in sklearn.neighbors.KDTree. I have this sample piece of code: from sklearn.neighbors import KDTree import numpy as np np.random.seed(0) X = np.random.random((10, 2)) tree =…
Ash
  • 3,428
  • 1
  • 34
  • 44
2
votes
2 answers

How may I calculate Accuracy in NLTK KMeans Clustering

I am trying to use NLTK's KMeans Clustering Algorithm. It is generally going fine. I want to use the Metrics package of NLTK to determine precision,recall and f measure. I searched for some examples in web and in other references but may be…
Coeus2016
  • 355
  • 4
  • 14
2
votes
1 answer

How to cluster a Time Series using DBSCAN python

So I have my data in the form of, X = [[T1],[T2]..] where Tn is the time series of nth user. I want to cluster these time series using the DBSCAN method using the scikit-learn library in python. When I try to directly fit the data, I get the output…
Siddharth Shah
  • 113
  • 4
  • 11
2
votes
0 answers

Multiple Regression - cannot allocate vector of size 4.7gb

First of all I wanna say that I have no clue about R and coding itself. I just have to do a regression with clustered standard errors for my bachelor thesis and I can't do that in Excel. I managed to do the linear regression with clustered standard…
2
votes
1 answer

Clustering 1000 images to find group of images with greater similarity

I have 1000 of 2D gray-scale images and would like to cluster them in python in a way that images with more similarities stay in same group. The images represents simple geometrical shapes including circles, triangle etc. If I wan to flatten each…
S PA
  • 103
  • 8
2
votes
3 answers

Issue with nested calls with psexec (access denied)

First of all, sorry for my poor english. I would try to explain my problem. I am using psexec within a script to restart a cluster as follows: script1 in node1: perform a lot of tasks (shutdown services, check status, etc..) in the node1 and after…
user41931
  • 31
  • 1
  • 4
2
votes
2 answers

how to cluster evolving data streams

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an…