0

I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10. This is the data file: https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing

My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:

from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)

yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()

mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))

But I get the following warning message when calculating precision:

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.

The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars message.

Furthermore, by inspecting preds, I got that the model predicts only negatives/zeros.

I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.

Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?

Edit: yc_TRAIN is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:

[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ] 
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Piers
  • 21
  • 4
  • What does `yc_TRAIN.sum()` give you? – Ari Cooper-Davis Mar 20 '22 at 09:05
  • What do these these `np.array([int(i==2) for i in y_TRAIN])` do? Please post a sample of your `yc_TRAIN` (i.e. as they are fed into the model). – desertnaut Mar 20 '22 at 09:47
  • @desertnaut I've edited the post to contain its description + sample. yc_TRAIN is the resulting array after I transformed all my y_true classes into either 1s or 0s to fit the binary classification task. In this case, out of the initial 10 classes I only turned class "2" into my positive label/1 and the remainder were all turned to 0. – Piers Mar 20 '22 at 12:39
  • @AriCooper-Davis it gives me 753, my number of positive cases – Piers Mar 20 '22 at 12:43
  • 753 positive out of how many in total? – desertnaut Mar 20 '22 at 12:45
  • @desertnaut 753 out of 9450 total cases. Do you think these results/warnings could be only of having an unbalanced dataset? – Piers Mar 20 '22 at 12:59
  • They are indeed due to the imbalance. – desertnaut Mar 20 '22 at 13:01

1 Answers1

0

Your code looks fine; this is a classic problem with imbalanced datasets, and it actually means you do not have enough training data to correctly classify the rare positive class.

The only thing you could improve in the given code is to set stratify=y_d in train_test_split, in order to get a stratified training set; decreasing the size of the test set (i.e. leaving more samples for training) may also help:

X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.10, random_state=23, stratify=y_d)

If this does not work, you should start thinking of applying class imbalance techniques (or different models); but this is not a programming question any more but a theory/methodology one, and it should be addressed at the appropriate SE sites and not here (see the intro and NOTE in the machine-learning tag info).

desertnaut
  • 57,590
  • 26
  • 140
  • 166