Gaussian Naive Bayes gives weird results

Question

This is a basic implementation of Gaussian Bayes using sklearn. Can anyone tell me what I'm doing wrong here, my K-Fold CV results are a bit weird:

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score, classification_report
import csv
from sklearn.model_selection import cross_val_score

column_names = ['AS', 'fh', 'class2']
df = pd.read_csv("C:/Users/Jans/Music/docx/222/test.csv",  sep=';', header = 0, names = column_names)

x = df.drop(['AS', 'class2'], axis=1)
df['class2'] = df['class2'].astype(int)
y = df['class2'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle = False, random_state = None)

model = GaussianNB()
model.fit(x_train, y_train.astype('int'))

k_fold_acc = cross_val_score(model, x_train, y_train, cv=10)
k_fold_mean = k_fold_acc.mean()
for i in k_fold_acc:
    print(i)
print("accuracy K Fold CV:" + str(k_fold_mean))

grid_predictions = model.predict(x_test)

my 10 Fold CV results (especially the first fold is very strange...):

0.36714285714285716
0.8271428571428572
0.9785714285714285
0.9357142857142857
0.9628571428571429
0.9957081545064378
1.0
1.0
0.994277539341917
0.9842632331902719
accuracy K Fold CV:0.90456774984672

Also, when I increase my test set from suppose 0.2 to 0.6 these are the results, which is also a bit strange.

Am I doing something wrong? And if yes, what?

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
accuracy K Fold CV:1.0

Please notice that ML questions that are not about *programming*, but about ML theory and/or methodology, are off-topic here; see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info for possible alternatives (the `machine-learning` tag was added, since this is actually a ML question not specific to scikit-learn, but the existence or not of the tag is immaterial to the question being actually off-topic). — desertnaut, Aug 05 '23 at 20:01

score 1 · Accepted Answer · answered Aug 05 '23 at 12:00

Regarding the second problem: when you increase the test set size to 0.6, this reduces the size of train set and makes it easier for the model to memorize your training data (overfitting). I think what you're seeing is that the model has overfit, attaining perfect accuracy. To regularise the model (reducing its tendency to overfit), increase the training data or make the model more regularised by introducing priors= for example.

Not sure about the first problem - it might just be 'sampling noise' where the first fold was a lot harder. If you've got a small test set there'll be more sampling-related variability in the folds. In your case, with 10-fold CV, the test set is 10% of the training data, and if the training set is small to begin with, then 10% of that is going to be even smaller. Set random_state=0 in order to get repeatable results, and that'll allow you to dig deeper into fold 0 if needed.

score 1 · Answer 2 · answered Aug 05 '23 at 12:14

There are two things here which may cause issues:

The data your developing the model with
Leakage your code is introducing

From a data perspective, it looks like your predicting the target variable with only one feature, 'fh'. If this feature is highly correlated with the target variable, then I would expect an unusually high accuracy. I haven't analysed your data, so I can't decide if this kind of behaviour is unusual or not based on the data you've feed into the model.

But even ignoring my concerns around the dataset itself, you also need to remove this line of code to avoid leakage:

model.fit(x_train, y_train.astype('int'))

Fitting the model before passing it to cross_val_score() will introduce leakage, since the model is being tested on data it's already seen. This might also explain why the accuracy of each fold is so high. Although I'm not certain why the first fold is an outlier.

When you increase test_size from 0.2 to 0.6 then you also reduce the sample size of the train dataset. This probably exacerbates the leakage problem, which is why the accuracy increase further to 1.0 for each fold. But again, I can't say for certain without knowing anything about the data your using.

"Fitting the model before passing it to cross_val_score() will introduce leakage" is incorrect. cross_val_score will refit clones of the model on each fold: they have not actually seen the full dataset. — Ben Reiniger, Aug 05 '23 at 15:15

Gaussian Naive Bayes gives weird results

2 Answers2