How to prevent NaN in the decision scores of Angle-based Outlier Detector in PyOd

Question

I am getting the nan values as decision scores when using Angle-based Outlier Detector because of which the outliers are not detected.

from pyod.models.abod import ABOD
from sklearn.preprocessing import MinMaxScaler

def outlier_ABOD(data, outliers_fraction=0.1):
    data = np.array([data]).reshape(-1, 1)

    scaler = MinMaxScaler(feature_range=(0, 1))
    data = scaler.fit_transform(data)

    clf = ABOD(contamination=outliers_fraction)
    clf.fit(data)
    y_pred = clf.predict(data)

    print(clf.decision_scores_)

    return np.where(y_pred)[0]

X1 = np.array([1,1,3,2,1,2,1,2,3,2,1,88,1234,8888,1,2,3,2])
outliers = outlier_ABOD(X1, 0.1)

OUTPUT:

Decision Scores: [            nan             nan -0.00000000e+00             nan
             nan             nan             nan             nan
 -0.00000000e+00             nan             nan -5.77145973e+03
 -3.60509466e+00 -6.08142776e-03             nan             nan
 -0.00000000e+00             nan]

Outliers: array([], dtype=int64)

So, If you see the output there are some NaN values because of which clf.threshold_ is also NaN. Hence clf could not detect outliers when using clf.predict method and clf.predict() is returning all zeros indicating there are no outliers but actually there are outliers. How to prevent this?

EDIT: When I have taken for different value of X1

X1 = np.array([3,2,1,88,9,7, 90, 1, 2, 3, 1, 98, 8888])
outliers = outlier_ABOD(X1, 0.1)

The output displayed is

Decision scores: [-3.14048147e+14 -5.54457418e+15 -3.46535886e+14 -1.58233289e+12
 -4.38660405e+12 -4.02831074e+13 -2.36040501e+12 -3.46535886e+14
 -5.54457418e+15 -3.14048147e+14 -3.46535886e+14 -7.76901896e+10
 -3.35886302e-05]

Outliers: array([   1,    1,    1,   98, 8888])

So, for the first X1 value there are NaNs in decision scores and hence cannot produce outliers and for the second X1 value there are no NaNs in decision scores and hence it is able to produce outliers. Now, I could not understand why for some X1 values it is giving NaN outputs and for others it is not.

use numpy or pandas fillna() functional first. the filling value is depends on insight understanding of data. No free hunch yet. — Yong Wang, May 07 '19 at 15:10
If you are saying that I should fill NaN in clf.decision_scores_ , my question would be fill with what? and why? — Surya Prakash Reddy, May 08 '19 at 08:49
no free hunch. use industry knowledge or experience or try mean, median , specific value or etc. it is time consuming and dirty work. — Yong Wang, May 08 '19 at 09:57
@YongWang Please check the EDIT part and request your comments about that. — Surya Prakash Reddy, May 09 '19 at 07:20

Yue Zhao · Answer 1 · 2019-05-10T19:28:14.980

1

For some reason, I do not feel ABOD works in your case as all the scores are NaN or zeors (close to 0). I feel there are some other issues rather than NaN. Have you tried other models as well, e.g., Isolation Forest?

Sorry, I do not have enough points to post a comment.

edited May 10 '19 at 19:28

answered May 08 '19 at 15:20

Yue Zhao

51
4

Hey, checkout the EDIT part in my question. I did not try other models but this models has issues with some specific inputs. For some values it is giving NaNs in decision scores and for other values it is not giving NaNs. I could not understand the reason behind that. – Surya Prakash Reddy May 09 '19 at 07:23
Hi,Yue Zhao. Nice to meet you anywhere. Zhihu, github and stackoveflow. Looks PyOD, you son is so hot. – Yong Wang May 10 '19 at 08:02
Hey Yong. Nice to see you here as well. Sorry I do not have enough points to upvote your answer. thanks for probing it as I am caught by so many things :( – Yue Zhao May 10 '19 at 19:33

score 0 · Answer 2 · answered May 10 '19 at 03:09

I reproduce the same result in my computer. I go same error and solved it.

In your case, The answer is do not use fast method. choice 'default'.
Recently, I do general outlier detection integration project, so go through some multidimension and high dimension Outlier detection algorithms. Isolated Forest is my favorite one. acceptable accuracy and almost fastest speed. ABOD, or other neighbours related algorithm is too complicated and slow to use. Though ABOD or other has tricks like fast mode, but they are based on specific assumption.

from pyod.models.abod import ABOD
from sklearn.preprocessing import MinMaxScaler,Normalizer,StandardScaler

def outlier_ABOD(data, outliers_fraction=0.1):
    data = np.array([data]).reshape(-1, 1)

    scaler = MinMaxScaler(feature_range=(0,1))
    #scaler = StandardScaler()
    data = scaler.fit_transform(data)

    clf = ABOD(contamination=outliers_fraction,method='default')
    clf.fit(data)
    y_pred = clf.predict(data)

    print(clf.decision_scores_)

    return np.where(y_pred)[0]

X1 = np.array([1,1,3,2,1,2,1,2,3,2,1,88,1234,8888,1,2,3,2])
X2 = np.array([3,2,1,88,9,7, 90, 1, 2, 3, 1, 98, 8888])
X1_outliers = outlier_ABOD(X1, 0.1)
X2_outliers = outlier_ABOD(X2, 0.1)
print(X1_outliers,X2_outliers)




[ -9.76962477e+14  -9.76962477e+14  -7.22132612e+14  -3.40246589e+15  
 -9.76962477e+14  -3.40246589e+15  -9.76962477e+14  -3.40246589e+15   -7.22132612e+14  -3.40246589e+15  -9.76962477e+14  -2.15972387e+07   -3.86731597e+02  -2.68433994e-03  -9.76962477e+14  -3.40246589e+15   -7.22132612e+14  -3.40246589e+15] [ -3.11767543e+14  -1.15742730e+15  -2.45343660e+14  -2.67101787e+11   -3.15072697e+12  -1.01170976e+13  -3.98826857e+11  -2.45343660e+14   -1.15742730e+15  -3.11767543e+14  -2.45343660e+14  -1.51894970e+10   -3.51433434e-05] 
[12 13] [11 12]

Thank you for your answer. I have tried using 'default' in but the issue is the time complexity of 'default' is O(n^3). I have 10,000 rows in my dataset and it will take huge time for that nearly 10,000 seconds which should be definitely avoided, that is the reason I am trying with 'fast' method. Isn't there any solution for that? also you said "they are based on specific assumption.". Can you tell me what those assumptions are? — Surya Prakash Reddy, May 10 '19 at 06:52
well, regarding the assumption, you can read the paper directly. regarding the performance, why not try isolated forest. — Yong Wang, May 10 '19 at 07:42
It is better you directly read the paper of the algorithm or check with pyod author YueZhao. I used to read the paper and give up the ABOD for that reason but can not find the paper now — Yong Wang, May 10 '19 at 08:00
@YongWang May I draw your attention to my [question](https://stackoverflow.com/questions/63175802/how-can-generate-impulse-as-outliers-on-periodic-or-sequenced-based-data-for-doi) since you are into outlier detection concept. — Mario, Jul 30 '20 at 15:09

score 0 · Answer 3 · answered Nov 05 '22 at 13:04

I have the same problem while using pyod ECOD. try to remove columns with single value before using fit:

feats = []
for i in data.columns:  
  if len(data[i].unique())==1 : feats.append(i)
data = data.drop(feats, axis=1) # drop single value features
print(feats)

hope that it works for you :)

How to prevent NaN in the decision scores of Angle-based Outlier Detector in PyOd

3 Answers3