-3

I'm trying to learn k-means clustering algorithm using Matlab. The problem is I cannot find any sample data that it will make it easier to understand the algorithm well. However, I find an example on mathworks which speciying the k-means clustering. But unfortunately,I cannot under stand it. I tried to understand this simple data-set which I found on Stack-overflow .

Please, I need a basic example on the k-means clustering, which if I implemented it on any software(i.e. matlab) I will be assure that I applying it correctly.

Finally, All the data-sets on the on the UCI for example are too large and I cannot figure if my implementation is correct or not.

Thanks in Advance.

Subhi
  • 322
  • 3
  • 11
  • What's wrong with generating your own data? [This example](https://www.mathworks.com/help/stats/kmeans.html#buefthh-2) seems to be pretty useful. Exactly what part of kmeans are you finding confusing? – beaker Jul 24 '17 at 20:47

4 Answers4

0

I know that you are using MatLab, but R has a number of datasets for testing clustering algorithms, including some that are fairly small. The ruspini data set is a good place to start. These datasets are available as csv files from github and MatLab should be able to read the csv files. Just search this page for the word cluster.

G5W
  • 36,531
  • 10
  • 47
  • 80
0

We've got a set of data which anyone would say fall into three clusters. We know that cluster number will be three, but otherwise we want the software to do the clustering for us.

So start out by assigning three objects to cluster centres at random. Now go through, and assign each object to its nearest cluster. The result is three clusters, but rather ugly ones, because it's unlikely we've hit the three actual centroids first time.

So take the mean vale of each cluster you have generated, and go through again, assinging the objects to the new cluster centroids. Repeat until the algorithm reaches stability. The process of taking the mean tends to force the guesses as to the cluster centres towards the actual centres.

It only works if data is actually clustered, however.

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18
0

The very classic iris data is okay for understanding k-means.

May even get to see some of the problems of k-means.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

Well,

let k={2,3,4,10,11,12,20,25,30} 

That's very simple. Lets take k into two data sets, pick two random numbers from each. I took 10 from k1 , 20 from k2 and arranged these two numbers in a way that what numbers are closer to 10 as a data set and numbers closer to 20 as another data set.. Remember you can choose any number.

k1={2,3,4,10,11,12},k2={20,25,30}

So distribute the big dataset into two and split them according to the nearest numbers. The first one will be the sum of all numbers/total number of digits, same for second.

{2+3+..+12}/6 = 7 
{20+25+30)/3= 25.. 

No matter how many iterations, the answer will be the same. This is called the threshold of mean where we get to the saturated point where there will be no change in it. So if you get different numbers keep performing the mean until you reach saturation.

Lorena Gomez
  • 1,946
  • 2
  • 4
  • 11