-1

Edit: please share comments as I'm learning to post good questions

I'm trying to train this dataset with IsolationForest(), I need to train this dataset, and use it in another dataset with altered qualities to predict the quality values and fetch all wines with quality 8 and 9.

However I'm having some problems with it. Because the accuracy score is 0.0 from the classification report:

print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       0.0
           3       0.00      0.00      0.00     866.0
           4       0.00      0.00      0.00     829.0
           5       0.00      0.00      0.00     841.0
           6       0.00      0.00      0.00     861.0
           7       0.00      0.00      0.00     822.0
           8       0.00      0.00      0.00     886.0
           9       0.00      0.00      0.00     851.0

    accuracy                           0.00    5956.0
   macro avg       0.00      0.00      0.00    5956.0
weighted avg       0.00      0.00      0.00    5956.0

I don't know if it's a hyperparameter issue, or if I'm clearing the wrong data or putting wrong parameters, I already tried to use with SMOTE and without SMOTE, I wanted to reach an accuracy of 90% at least.

I'll leave the shared drive link public for dataset verification::

https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing

Here's my code:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix

df = pd.read_csv('wines.csv')

df.head(5)

ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])

df.info()

df['color'] = df['color'].astype(int)

df.head(3)

stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)

print(x_smote.shape,y_smote.shape)

x_smote.columns

scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote

X.shape, y.shape

x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss

iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1, 
                            random_state=None, verbose=0, warm_start=False)

iforest_fit = iforest.fit(x_train,y_train)

prediction = iforest_fit.predict(x_test)

print (prediction.shape, y_test.shape)

y.value_counts()

prediction

print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)

from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))

1 Answers1

1

May I know why do you choose Isolation Forest as your model? This article says that Isolation Forest is an unsupervised learning algorithm for anomaly detection.

When I print some samples of the prediction (by Isolation Forest) and samples of actual truth, I get the following results, so you know why the accuracy score is 0.0:

print(list(prediction[0:15]))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

print(list(y_test[0:15]))
[9, 4, 4, 7, 9, 3, 6, 7, 4, 8, 8, 7, 3, 8, 5]

The wines.csv dataset and your code are both pointing towards a multi-class classification problem. I have chosen RandomForestClassifier() to continue with the second part of your code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss

model = RandomForestClassifier()
model.fit(x_train,y_train)
prediction = model.predict(x_test)

print(prediction[0:15])    #see 15 samples of prediction
[3, 9, 5, 5, 7, 9, 7, 6, 9, 8, 5, 9, 8, 3, 3]

print(list(y_test[0:15]))    #see 15 samples of actual truth
[3, 9, 5, 6, 6, 9, 7, 5, 9, 8, 5, 9, 8, 3, 3]

print(confusion_matrix(y_test, prediction))
[[842   0   0   0   0   0   0]
 [  2 815  17   8   1   1   0]
 [  8  50 690 130  26   2   0]
 [  2  28 152 531 128  16   0]
 [  4   1  15  66 716  32   3]
 [  0   1   0   4  12 833   0]
 [  0   0   0   0   0   0 820]]

print('hamming_loss =', hamming_loss(y_test, prediction))
hamming_loss = 0.11903962390866353

print(classification_report(y_test, prediction))
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       842
           4       0.91      0.97      0.94       844
           5       0.79      0.76      0.78       906
           6       0.72      0.62      0.67       857
           7       0.81      0.86      0.83       837
           8       0.94      0.98      0.96       850
           9       1.00      1.00      1.00       820

    accuracy                           0.88      5956
   macro avg       0.88      0.88      0.88      5956
weighted avg       0.88      0.88      0.88      5956

The accuracy is already 0.88 even before tuning hyperparameters.

blackraven
  • 5,284
  • 7
  • 19
  • 45
  • Hi perpetualstudent, in this exercises I'm doing, it asks me to use isolation forest to train this dataset, and use it in another dataset with altered qualities to predict the quality values and fetch all wines with quality 8 and 9, that's why I'm finding it strange everything gives 0, I've already tried to do at least 3 different ways to try a result in training, but will I only be able to get a result playing in the altered dataset? – Gabriel Rodrigues Aug 24 '22 at 14:50
  • i see.. would you try with the base model `iforest = IsolationForest()` and see what you get? My conda just crashed, and I'd have to reinstall tomorrow (it's 11pm now) – blackraven Aug 24 '22 at 15:02
  • I'll try that now, I'll try using the other dataset together too, do you want me to upload the changed dataset here for you to take a look too? – Gabriel Rodrigues Aug 24 '22 at 15:06
  • sure.. you could use the same google drive – blackraven Aug 24 '22 at 15:10
  • I've already uploaded it there, it must already have 2 files, wines and wines_hacked, I tried to change the dataframe a little and categorized wines with a grade greater than 8 for one numer and of smaller grades for another, but I ended up finding another error hehehe – Gabriel Rodrigues Aug 24 '22 at 20:13
  • I'm beginning to suspect maybe `1` is normal, and `0` is abnormal, that's why you're getting so many `1`s in the prediction – blackraven Aug 24 '22 at 22:47
  • I noticed that this works with outliers yes, I looked at the other dataset that I need to compare the prediction and I saw that the "color" column was droped, maybe this could be influencing the algorithm in a bad way – Gabriel Rodrigues Aug 25 '22 at 00:39