1

I used Naive Bayes from Spark's MlLib to train a model and test it on the data (in the form of an RDD). The results were confusing.

the data and results are as follows:

The problem is a binary classification one. The outcome should be either a label with '0' or '1'.

total number of labels with '0' in the testing dataset - 11774

total number of labels with '1' in the testing dataset - 246

Code for reference:

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
from pyspark.mllib.evaluation import MulticlassMetrics

def parsePoint(line):
    values = [float(x) for x in line]
    return LabeledPoint(values[-1], values[0:-1])

data = myRDD.map(parsePoint)

# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=0)

# Train a naive Bayes model.
model = LogisticRegressionWithLBFGS.train(training, 1.0)

#labelsAndPreds = test.map(lambda p: (p.label, model.predict(p.features)))
predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
accuracy =1.0 * predictionAndLabels.filter(lambda (v, p): v == p).count() / test.count()
accuracy

after applying the model and obtaining the predictions :

True Positives - 11774

False Positives - 0

False Negatives - 246

True Negatives - 0

All my '0' labels are correctly classified and whereas all the '1' labels are incorrectly classified!

Now, this is a part of my project and I'm not sure if the results are fine to be submitted.

The code I wrote using Spark's Python API does this: it gets the data from a file and builds the RDD. I just fed this RDD into the Spark MlLib's Naive Bayes documentation provided on the website and the result is as above.

Can someone please tell me if this result is normal?

preetham madeti
  • 309
  • 1
  • 4
  • 10
  • IMHO this belongs on stats.stackexchange.com but it is more or less something you may expect if distribution of classes is highly skewed. – zero323 Apr 30 '16 at 12:45
  • The data I used is from the 'posts.small.xml' file from 'Posts.xml' file in the SO/stack exchange archives. Suppose, if we say that the distribution of classes is highly skewed, would it be a problem because of the code or of the data..? Please give me this input. Thank you! – preetham madeti Apr 30 '16 at 13:09
  • 1
    (Probably) Data (feature extraction) and specific model. "Zeros" account for roughly 98% of your data, right? So unless have very strong is evidence that something is "one" then the right is "zero". This is pretty much what NB is doing. – zero323 Apr 30 '16 at 13:42
  • 1
    @zero323 is right. Another way to think about it is, when a positive label is rare, the easiest way to get a high accuracy is to simply classify everything as negative. One way you can prevent this is to downsample the number of negative cases, increasing the ratio of positive:negative labels in your training data. Try googling "imbalanced class labels" or "skewed label distributions" for more approaches. – Galen Long Apr 30 '16 at 18:01
  • It clearly means that your classifier is predicting everything to be a positive class – Failed Scientist Dec 04 '18 at 05:09

0 Answers0