0

I am trying to build a text classification model using Naive-bayes algorithm.

Here's my sample data (label and feature):

1|combusting [chemical]
1|industrial purposes
1|
2|salt for preserving, 
2|other for foodstuffs
2|auxiliary 
2|fluids for use with abrasives
3|vulcanisation 
3|accelerators
3|anti-frothing solutions for batteries
4|anti-frothing solutions for accumulators
4|acetates 
4|[chemicals]*
4|acetate of cellulose, unprocessed

Following is my sample code

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.{NaiveBayes,     NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.feature.HashingTF

val rawData = sc.textFile("~/data.csv")

val rawData1 = rawData.map(x => x.replaceAll(",","")) 

val htf = new HashingTF(1000) 

val parsedData = rawData1.map { line =>
val values = (line.split("|").toSeq)
val featureVector = htf.transform(values(1).split(" "))
val label = values(0).toDouble
LabeledPoint(label, featureVector)
}

val splits = parsedData.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)

val model = NaiveBayes.train(training, lambda = 2.0, modelType = "multinomial")

val predictionAndLabels = test.map { point => 
val score = model.predict(point.features)
(score, point.label)
}

val metrics = new MulticlassMetrics(predictionAndLabels)
metrics.labels.foreach( l => println(metrics.fMeasure(l)))

val testData1 = htf.transform("salt")
val predictionAndLabels1 = model.predict(testData1)

I am getting approx 33% accuracy (very less), and testing data predict wrong label. I have printed parsedData which contains label and feature as below:

(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))

I am not able to find it out what's missing; hashing term frequency function seems generating repeated data term frequency. Kindly suggest me to improve the model performance, Thanks in advance

Masoud
  • 1,343
  • 8
  • 25
  • Read this post [Text Classification](http://stackoverflow.com/questions/34345189/text-classification-how-to-approach?answertab=active#tab-top). I think all you did was good except for the hashing part. – Alberto Bonsanto Jan 15 '16 at 14:25

1 Answers1

0

You have to ask yourself many questions before starting implementing your algorithm:

  • Your texts looks very short, what is the size of your vocabulary, answering this will help you in tuning the value of the HashingTF dimensionality. In your case, you might need to use lower value.
  • You might need to consider doing some pre-processing on your texts. e.g. using StopWordsRemover, stemming, Tokenizer?
  • A tokenizer will construct a better text than the ad-hoc text processing you are doing.
  • Change your parameters, namely the NumFeatures of the HashingTF and the lambda of the Naive Bayes.
  • Basically in Machin Learning you will need to do CrossValidation on a set of parameters in order to optimise your results. Check this example and try to do something similar by tuning your HashingTF and the lambda as follow:
val paramGrid = new ParamGridBuilder()
 .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
 .addGrid(naiveBayes.lambda, Array(0.1, 0.01))
 .build()

In general, using Pipelines and CrossValidation works with Naive Bayes for multi-class classification, so have a look on here rather than hardcoding all septs with your hands.

Rami
  • 8,044
  • 18
  • 66
  • 108
  • Thanks for your reply, Training text is more than 1000 row that's why I used htf (1000). I will try tokenizer and stopwordremover method to pre-process the text. –  Jan 17 '16 at 07:33