sklearn: Naive Bayes classifier gives low accuracy

Question

I have a dataset which includes 200000 labelled training examples. For each training example I have 10 features, including both continuous and discrete. I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).

First let me write the code which I have written so far:

from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.

Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?

Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?

Any thoughts or suggestions will be much appreciated.

train/test splits is *not* done automatically, but there are many built-in features to let you do this easily. — juanpa.arrivillaga, Nov 10 '16 at 19:56
take a look at slearn's functions for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html) — jkr, Nov 10 '16 at 19:58
On the other hand, you are fitting the model to all of your data, so one would expect relatively high accuracy when predicting on that same data. You might want to look into tuning the hyperparameters of your model (see [`sklearn`'s page on parameter tuning](http://scikit-learn.org/stable/modules/grid_search.html) — jkr, Nov 10 '16 at 20:06

score 6 · Answer 1 · answered Nov 10 '16 at 20:40

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.

This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.

Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?

No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.

Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?

Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).

I want to test multiple models in order to make some predictions using various algorithms and produce a report. The 20% I mentioned above is the accuracy, not the missclassified predictions.by thw way, you mentioned that I need different distributions in discrete features. Could you please tell me how can I do it (even manually). — Giorgos Myrianthous, Nov 10 '16 at 20:58
This is still possible with Naive Bayes unfortunately. How many classes do you have there? — lejlot, Nov 10 '16 at 20:59

sklearn: Naive Bayes classifier gives low accuracy

1 Answers1