Outlier detection in probability/ frequency distribution

Question

I have following two dimensional dataset. Both (X and Y) are continuous random variables.

Z = (X, y) = {(1, 7), (2, 15), (3, 24), (4, 25), (5, 29), (6, 32), (7, 34), (8, 35), (9, 27), (10, 39)}

I want to detect outliers with respect to the y variable's values. The normal range for y variable is 10-35. Thus 1st and last pairs, in above dataset, are outliers and others are normal paris. I want to transform variable z = (x, y) into probability/ frequency distribution that outlier values (first and last pair) lies outside standard deviation 1. Can any one help me out to solve this problem.

PS: I have tried different distances such as eucledian and mahalanobis distances but they didn't worked.

score 1 · Accepted Answer · answered Dec 06 '13 at 02:33

1

I'm not exactly sure what your end goal is, but I'm going to assume you format your x,y variables in a nx2 matrix, so z = [x,y] where x:= nx1 and y:= nx1 vectors.

So what you are asking is for a way to separate out data points where y is outside of 10-35 range? For that you can use a conditional statement to find indexes where that occurs:

index = z(:,2) <= 35 & z(:,2) >= 10;  %This gives vector of 0's & 1's length nx1
z_inliers = z(index,:);      %This has a [x,y] matrix of only inlier data points
z_outliers = z(~index,:);    %This has a [x,y] matrix of outlier data points

If you want to do this according to standard deviation then instead of 10 and 35 do:

low_range = mean(z(:,2)) - std(z(:,2));
high_range = mean(z(:,2)) + std(z(:,2));
index = y <= high_range & y >= low_range;

Then you can plot your pdf's or whatever with those points.

answered Dec 06 '13 at 02:33

cjtytler

60
6

The second method is more closer to what I need but If I change even one value of Y variable it results in changed values for low_range and high_range variables which gives wrong final outcome. I need more stable method by which, even with change in values in Y variable, I can distinguish between inliers and outliers. – mani Dec 06 '13 at 04:29
One standard deviation is a relative value based on a data set, so you have to decide from what data set your "outliers" are defined from. If you want to have an overall collection of data from which the outliers are defined you'll need to define the low_range and high_range first and store those values, then you can modify the y values, or create a new vector of y values that are inliers only as shown above. If memory space is not a concern, I would suggest just saving two different sets of y: one that is complete to define outlier range from and another that has outliers filtered out – cjtytler Dec 06 '13 at 16:29
In my case, memory is definately a concern. What about second and third standard deviation? I guess they are dependant on first deviation. So, effectively they are also dependant on data set. So, my question is how system can be trained and tested in this case? What is the best approach? – mani Dec 06 '13 at 23:41

Outlier detection in probability/ frequency distribution

1 Answers1