I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10. This is the data file: https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing
My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)
yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()
mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))
But I get the following warning message when calculating precision:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars
message.
Furthermore, by inspecting preds
, I got that the model predicts only negatives/zeros.
I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.
Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?
Edit: yc_TRAIN
is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]