7

I have number of smaller data sets, containing 10 XY coordinates each. I am using Matlab (R2012a)and k-means to obtain a centroid. In some of the clusters (see figure below) I can see some extreme points, beacuse my dataset are as small as they are, one outliner destroys the value of my centroid. Is there a easy way to exlude these points? Supposingly Matlab has a 'exclude outliers' function but I can't see it anywhere in the tool menu.. Thank you for your help! (and yes I am new to this:-)

enter image description here

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
carro
  • 109
  • 1
  • 1
  • 6
  • 2
    I think the word you are looking for is Outlier (http://en.wikipedia.org/wiki/Outlier), not Outliner. Maybe this will help you find a solution more easily. – Medo42 Dec 21 '12 at 11:38
  • 1
    It would also be a good idea to mention what software you're using and tag accordingly. – kotekzot Dec 21 '12 at 11:41
  • sorry, I wrote outliners accidently;( I have done a fair bit of searching but I am a beginner in matlab and the codes that I have encountered so far are pretty heavy. Looked at ORC and ODIN, but according to matlab help function there should be something called 'exclude outliers' in the toolbar but I cant find it, I am using R2012a. – carro Dec 21 '12 at 11:46

2 Answers2

3

k-means can be quite sensitive to outliers in your data set. The reason is simply that k-means tries to optimize the sum of squares. And thus a large deviation (such as of an outlier) gets a lot of weight.

If you have a noisy data set with outliers, you might be better off using an algorithm that has specialized noise handling such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Note the "N" in the acronym: Noise. In contrast to e.g. k-means, but also many other clustering algorithms, DBSCAN can decide to not cluster objects that are in regions of low density.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
0

You're looking for something like "Outlier removal" and as others have linked to above, "there is no rigorous mathematical definition of what constitutes an outlier" - http://en.wikipedia.org/wiki/Outlier#Identifying_outliers.

Outlier detection is even more difficult when you're doing unsupervised clustering since you're both trying to learn what the clusters are, and what data points correspond to "no" clusters.

One simple definition is to consider all data points that are "far" from every other data point as an outlier. E.g., you might consider removing the point with the maximum smallest distance to any other point:

x = randn(100,2); 
x(101,:) = [10 10];  %a clear outlier
nSamples = size(x,1);

pointToPointDistVec = pdist(x);
pointToPointDist = squareform(pointToPointDistVec);
pointToPointDist = pointToPointDist + diag(inf(nSamples,1)); %remove self-distances; set to inf

smallestDist = min(pointToPointDist,[],2);
[maxSmallestDist,outlierInd] = max(smallestDist);

You can iterate the above a few times to iteratively remove points. Note that this will not remove outliers that happen to have at least one nearby neighbor. If you read the WIKI page, and see an algorithm that might be more helpful, try and implement it and ask about that specific approach.

Pete
  • 2,336
  • 2
  • 16
  • 23