4

I have read the documentation of the decision function and score_samples here, but could not figure out what is the difference between these two methods and which one should I use for an outlier detection algorithm.

Any help would be appreciated.

Anne
  • 77
  • 8

3 Answers3

5

See the documentation for the attribute offset_:

Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • So, if I want to have anomaly scores which need to be between 0 and 1, and higher score means more likely to be an outlier, then I should get the scores as `-isof.score_samples(X)`? – panc Sep 16 '22 at 05:33
1

As was previously stated in @Ben Reiniger's answer, decision_function = score_samples - offset_. For further clarification...

  • If contamination = 'auto', then offset_ is fixed to 0.5
  • If contamination is set to something other than 'auto', then offset is no longer fixed.

This can be seen under the fit function in the source code:

def fit(self, X, y=None, sample_weight=None):

...

   if self.contamination == "auto":
            # 0.5 plays a special role as described in the original paper.
            # we take the opposite as we consider the opposite of their score.
            self.offset_ = -0.5
            return self

        # else, define offset_ wrt contamination parameter
        self.offset_ = np.percentile(self.score_samples(X),
                                     100. * self.contamination)

Thus, it's important to take note of what contamination is set to, as well as which anomaly scores you are using. score_samples returns what can be thought of as the "raw" scores, as it is unaffected by offset_, whereas decision_function is dependent on offset_

0

The User Guide references the paper Isolation forest written by Fei Tony, Kai Ming and Zhi-Hua.

I did not read the paper, but I think you can use either output to detect outliers. The documentation says score_samples is the opposite of decision_function, so I thought they would be inversely related, but both outputs seem to have the exact same relationship with the target. The only difference is that they are on different ranges. In fact, they even have the same variance.

To see this, I fit the model to the breast cancer dataset available in sklearn and visualized the average of the target variable grouped by the deciles of each output. As you can see, they both have the exact same relationship.

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest

# Load data
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']

# Fit model
clf = IsolationForest()
clf.fit(X, y)

# Split the outputs into deciles to see their relationship with target
t = pd.DataFrame({'target':y,
                  'decision_function':clf.decision_function(X),
                  'score_samples':clf.score_samples(X)})
t['bins_decision_function'] = pd.qcut(t['decision_function'], 10)
t['bins_score_samples'] = pd.qcut(t['score_samples'], 10)

# Visualize relationship
plt.plot(t.groupby('bins_decision_function')['target'].mean().values, lw=3, label='Decision Function')
plt.plot(t.groupby('bins_score_samples')['target'].mean().values, ls='--', label='Score Samples')
plt.legend()
plt.show()

Relationship

Like I said, they even have the same variance:

t[['decision_function','score_samples']].var()
> decision_function    0.003039
> score_samples        0.003039
> dtype: float64

In conclusion, you can use them interchangeably as they both share the same relationship with the target.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • Thank you for your thorough response. What is t in your code as it says it is not defined. – Anne Jun 21 '21 at 04:42
  • But if you use the contamination value of some number between [0.1, 0.5] instead of 'auto', you would get different results from the mentioned methods – Anne Jun 21 '21 at 04:55
  • Hi Anne. I added `t` to my code. Sorry I forgot to post that bit. Could you expand on your last comment? – Arturo Sbr Jun 21 '21 at 14:09
  • The documentation says that `score_samples` is the opposite of _the original paper's anomaly scores_, not the sklearn `decision_function`. – Ben Reiniger Jun 21 '21 at 14:11
  • I found `score_samples` returns negative values. So I assume the doc is correct. in order to get the original scores I should do `-clf.score_samples(X)` – panc Sep 16 '22 at 05:36