I have a dataset which includes 200000 labelled training examples.
For each training example I have 10 features, including both continuous and discrete.
I'm trying to use sklearn
package of python in order to train the model and make predictions but I have some troubles (and some questions too).
First let me write the code which I have written so far:
from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)
The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn
or should I fit
the model using the training dataset and then call predict
using the validation set?
Any thoughts or suggestions will be much appreciated.