Questions tagged [k-means]

k-means is a clustering algorithm, implemented in popular data science tools. Use this tag for questions related to the k-means clustering algorithm itself, or to its use with the tools that implement it (alongside other tags specific to those tools).

In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean by least-squared deviations.

For detailed info check Wikipedia entry at http://en.wikipedia.org/wiki/K-means_clustering

3514 questions
34
votes
1 answer

Cluster one-dimensional data optimally?

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?
Laciel
  • 367
  • 1
  • 3
  • 6
31
votes
3 answers

Understanding "score" returned by scikit-learn KMeans

I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10). Now when…
Prateek Dewan
  • 1,587
  • 3
  • 16
  • 29
31
votes
2 answers

Scikit-learn: How to run KMeans on a one-dimensional array?

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array…
Irene
  • 579
  • 2
  • 10
  • 19
31
votes
4 answers

whats is the difference between "k means" and "fuzzy c means" objective functions?

I am trying to see if the performance of both can be compared based on the objective functions they work on?
n0ob
  • 1,275
  • 8
  • 20
  • 23
30
votes
1 answer

Online k-means clustering

Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with…
Theodor
  • 5,536
  • 15
  • 41
  • 55
28
votes
5 answers

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following error: Error in do_one(nmeth) : NA/NaN/Inf in…
Jonathan Rhein
  • 1,616
  • 3
  • 23
  • 47
26
votes
6 answers

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has…
26
votes
1 answer

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below: documents =…
26
votes
2 answers

Estimation of number of Clusters via gap statistics and prediction strength

I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of clusters in iris data with 3 clusters. Instead…
Riyaz
  • 1,430
  • 2
  • 17
  • 27
26
votes
2 answers

What is the time complexity of k-means?

I was going through the k-means Wikipedia page. Based on the algorithm, I think the complexity is O(n*k*i) (n = total elements, k = number of cluster iteration) So can someone explain me this statement from Wikipedia and how is this NP hard? If k…
parallel
  • 303
  • 1
  • 3
  • 9
25
votes
2 answers

Group n points in k clusters of equal size

Possible Duplicate: K-means algorithm variation with equal cluster size EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that cover this one:…
Pierre-David Belanger
  • 1,004
  • 1
  • 11
  • 19
25
votes
2 answers

K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong

I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the stats package. However I was notable to get a…
user2974776
  • 301
  • 1
  • 3
  • 8
24
votes
4 answers

Changes of clustering results after each time run in Python scikit-learn

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I…
user3430235
  • 419
  • 1
  • 4
  • 12
23
votes
3 answers

Using K-means with cosine similarity - Python

I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric. I understand that using different distance function can be fatal and should done carefully. Using cosine distance…
ise372
  • 231
  • 1
  • 2
  • 5
23
votes
3 answers

kmeans scatter plot: plot different colors per cluster

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color. sentence_list=["Hi how are you", "Good morning" ...] #i…
jxn
  • 7,685
  • 28
  • 90
  • 172