I am trying to build a text classification model using Naive-bayes algorithm.
Here's my sample data (label and feature):
1|combusting [chemical]
1|industrial purposes
1|
2|salt for preserving,
2|other for foodstuffs
2|auxiliary
2|fluids for use with abrasives
3|vulcanisation
3|accelerators
3|anti-frothing solutions for batteries
4|anti-frothing solutions for accumulators
4|acetates
4|[chemicals]*
4|acetate of cellulose, unprocessed
Following is my sample code
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.feature.HashingTF
val rawData = sc.textFile("~/data.csv")
val rawData1 = rawData.map(x => x.replaceAll(",",""))
val htf = new HashingTF(1000)
val parsedData = rawData1.map { line =>
val values = (line.split("|").toSeq)
val featureVector = htf.transform(values(1).split(" "))
val label = values(0).toDouble
LabeledPoint(label, featureVector)
}
val splits = parsedData.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 2.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new MulticlassMetrics(predictionAndLabels)
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
val testData1 = htf.transform("salt")
val predictionAndLabels1 = model.predict(testData1)
I am getting approx 33% accuracy (very less), and testing data predict wrong label. I have printed parsedData which contains label and feature as below:
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
I am not able to find it out what's missing; hashing term frequency function seems generating repeated data term frequency. Kindly suggest me to improve the model performance, Thanks in advance