ELKI's LOF implementation for heavily duplicated data

Question

Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?

score 2 · Accepted Answer · answered Sep 14 '15 at 08:22

It is not so much a matter of ELKI, but of the algorithms.

Most outlier detection algorithms use the k nearest neighbors. If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.

But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:

add jitter to the data set
remove duplicates (but never consider highly dupilcated values outliers!)
increase the neighborhood size

It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.

This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.

Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation. 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.

In other words, this is not a bug in ELKI. The implementation does what is published.

Thanks for replying. Does'nt the knn algorithm look for nearest neighbours based on distance and not on instance? For example for an observation `o`, if `k=3` and the first nearest neighbour n1 has two more duplicates `n1'` and `n1'' `, then both these duplicates will also be the first nearest neighbours of `o` and not the second and third nearest neighbours. And that is why the size of the neighbourhood may not always be equal to k (section 4, page 3 of "LOF: Identifying Density-Based Local Outliers"). — Ira, Sep 16 '15 at 10:22
In that case, "the neighbors of duplicated points can obtain an outlier score of infinity" should never happen, unless all the data points are just duplicates. Is'nt it? — Ira, Sep 16 '15 at 10:22
Duplicates are included, but ties are broken by inclusion in ELKI. So if the nearest neighbor has 5 duplicates, that counts as 5 objects. There is the notion of "k-distinct neighbors" but that is currently not available in ELKI. — Erich Schubert, Sep 16 '15 at 13:17
Infinite outlier scores: Consider the data set "-1, 0, 0, 0, 1" and minPts=2. What is the LOF score of each point? The points at 0 have infinite density. — Erich Schubert, Sep 16 '15 at 13:18
Is'nt k-distinct neighbour a critical point made in the algorithm? I have seen the same problem in all R-implementations too. The kNN searches return duplicates as distinct neighbours. Which imho is not correct and this would lead to incorrect results. And I agree with you, in dataset { -1,0,1} with MinPts=2, 0 doesnt have a second distinct neighbour and hence the infinite LOF. And that is why the choice of MinPts is crucial. — Ira, Sep 18 '15 at 06:13
In my opinion it is only a workaround, that won't ultimately help. Doing k-distinct nearest neighbors would be nice, but doing it efficiently in 20 indexes is a lot of work. kNN with a fixed k (+ ties) is easier. Patches to add k-distinct support are welcome; but I don't think it is of high priority. — Erich Schubert, Sep 21 '15 at 08:09

ELKI's LOF implementation for heavily duplicated data

1 Answers1