Edit: please share comments as I'm learning to post good questions
I'm trying to train this dataset with IsolationForest()
, I need to train this dataset, and use it in another dataset with altered qualities to predict the quality values and fetch all wines with quality 8 and 9.
However I'm having some problems with it. Because the accuracy score is 0.0
from the classification report:
print(classification_report(y_test, prediction))
precision recall f1-score support
-1 0.00 0.00 0.00 0.0
1 0.00 0.00 0.00 0.0
3 0.00 0.00 0.00 866.0
4 0.00 0.00 0.00 829.0
5 0.00 0.00 0.00 841.0
6 0.00 0.00 0.00 861.0
7 0.00 0.00 0.00 822.0
8 0.00 0.00 0.00 886.0
9 0.00 0.00 0.00 851.0
accuracy 0.00 5956.0
macro avg 0.00 0.00 0.00 5956.0
weighted avg 0.00 0.00 0.00 5956.0
I don't know if it's a hyperparameter issue, or if I'm clearing the wrong data or putting wrong parameters, I already tried to use with SMOTE and without SMOTE, I wanted to reach an accuracy of 90% at least.
I'll leave the shared drive link public for dataset verification::
https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing
Here's my code:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix
df = pd.read_csv('wines.csv')
df.head(5)
ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])
df.info()
df['color'] = df['color'].astype(int)
df.head(3)
stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)
print(x_smote.shape,y_smote.shape)
x_smote.columns
scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote
X.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss
iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1,
random_state=None, verbose=0, warm_start=False)
iforest_fit = iforest.fit(x_train,y_train)
prediction = iforest_fit.predict(x_test)
print (prediction.shape, y_test.shape)
y.value_counts()
prediction
print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))