0

I'm doing a thesis on ML: I have to run 4 models (linear regression, robust regression, binary logistic regression and naive bayes classifier) on severals CSV files.

I started my approach reading the official reference and trying something. Being my first time working on ML, I have some doubts and questions for you:

  1. Should I fit and predict my model before doing my K-Fold CV ?

  2. I have to calculate these metrics: precision, specificity, recall, false omission rate, prevalence, accuracy, F2-Score, MCC, Informedness and Markedness. Some of them are calculated automatically by the library, some others such as specificity I need to calculate them manually. I know these metrics can be calculated from the Confusion Matrix which gives TP,TN,FP,FN. The problem is I know how to generate the Confusion Matrix after predicting my model before evaluating my K-Fold CV. But I think I need to calculate these metrics bases on my CV. How can I get the Confusion Matrix for every iteration from my K-Fold Cross Validation ? So I could sum all the matrix and extract my TP,TN,FP,FN and calculate my preferred metrics.

  3. I drop the first 2 columns from CSV and then I use the first 20 as input and the last as output. Sometimes I get this error but not eveytime:

Traceback (most recent call last): File "/home/user/main.py", line 60, in temp = np.delete(cm, i, 0) # delete ith row File "<array_function internals>", line 180, in delete File "/home/user/.local/lib/python3.10/site-packages/numpy/lib/function_base.py", line 5156, in delete raise IndexError( IndexError: index 19 is out of bounds for axis 0 with size 19

Code

import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, multilabel_confusion_matrix
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.naive_bayes import GaussianNB

my_file = '/home/user/file.csv'
df = pd.read_csv(my_file)
del df['Name']
del df['version']

X = df[['wmc','dit','noc','cbo','rfc','lcom','ca','ce','npm','lcom3','loc','dam','moa','mfa','cam','ic','cbm','amc','max_cc','avg_cc']]
Y = df['bug']

X_train, X_test, y_train, y_test = train_test_split(X,Y)

gnb = GaussianNB()
gnb.fit(X, Y)
y_pred = gnb.predict(X_test)

kf = KFold(n_splits=10, random_state=42, shuffle=True)
scores = cross_val_score(gnb, X, Y,scoring='accuracy', cv=kf, n_jobs=-1)

cm = confusion_matrix(y_test,y_pred)

FP = cm.sum(axis=0) - np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = cm.sum() - (FP + FN + TP)

accuracy = (TP+TN)/(TP+TN+FP+FN)
print('Calculated accuracy: {}', accuracy)
print(sum(list(accuracy))/len(accuracy))

TruePositive = np.diag(cm)

num_classes = 20

TrueNegative = []
for i in range(num_classes):
    temp = np.delete(cm, i, 0)   
    temp = np.delete(temp, i, 1) 
    TrueNegative.append(sum(sum(temp)))


FalsePositive = []
for i in range(num_classes):
    FalsePositive.append(sum(cm[:,i]) - cm[i,i])


FalseNegative = []
for i in range(num_classes):
    FalseNegative.append(sum(cm[i,:]) - cm[i,i])
skyfenix
  • 1
  • 1

0 Answers0