I used Naive Bayes from Spark's MlLib to train a model and test it on the data (in the form of an RDD). The results were confusing.
the data and results are as follows:
The problem is a binary classification one. The outcome should be either a label with '0' or '1'.
total number of labels with '0' in the testing dataset - 11774
total number of labels with '1' in the testing dataset - 246
Code for reference:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
from pyspark.mllib.evaluation import MulticlassMetrics
def parsePoint(line):
values = [float(x) for x in line]
return LabeledPoint(values[-1], values[0:-1])
data = myRDD.map(parsePoint)
# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=0)
# Train a naive Bayes model.
model = LogisticRegressionWithLBFGS.train(training, 1.0)
#labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
accuracy =1.0 * predictionAndLabels.filter(lambda (v, p): v == p).count() / test.count()
accuracy
after applying the model and obtaining the predictions :
True Positives - 11774
False Positives - 0
False Negatives - 246
True Negatives - 0
All my '0' labels are correctly classified and whereas all the '1' labels are incorrectly classified!
Now, this is a part of my project and I'm not sure if the results are fine to be submitted.
The code I wrote using Spark's Python API does this: it gets the data from a file and builds the RDD. I just fed this RDD into the Spark MlLib's Naive Bayes documentation provided on the website and the result is as above.
Can someone please tell me if this result is normal?