1

i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.

zero323
  • 322,348
  • 103
  • 959
  • 935
decipher
  • 498
  • 2
  • 4
  • 16

2 Answers2

0

For text classification, you need:

  • A word dictionary
  • Convert document into vector using the dictionary
  • Label the document vectors:

    doc_vec1 -> label1

    doc_vec2 -> label2

    ...

This sample is pretty straghtforward.

David S.
  • 10,578
  • 12
  • 62
  • 104
  • thanks for the answer..but i am a bit confused about the second part: Convert document into vector using the dictionary as if i have labels like (music, politics, entertainment) and top 50 words from each documents . how do i map them to build a classifier. As far spark example they already have the text file which can utilised as input for naive bayes. – decipher Jan 17 '16 at 11:56
  • a `doc vec` is just a long array, say 50 elements, the index of the array is the id of the word in your dict, and the value of the element is the *count* of the corresponding word. It is suggested you **normalize** the *count*. – David S. Jan 17 '16 at 13:27
0
    from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
    from pyspark.ml.classification import NaiveBayes

    # regular expression tokenizer
    regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", 
    pattern="\\W")
    # stop words
    add_stopwords = ["http","https","amp","rt","t","c","the"] 
    stopwordsRemover = 
  StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
    # bag of words count
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
    vocabSize=10000, minDF=5)
    (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
    nb = NaiveBayes(smoothing=1)
    model = nb.fit(trainingData)
    predictions = model.transform(testData)
    predictions.filter(predictions['prediction'] == 0) \
     .select("Descript","Category","probability","label","prediction") \
     .orderBy("probability", ascending=False) \
     .show(n = 10, truncate = 30)
  • Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). This provided answer may be correct, but it could benefit from an explanation. Code only answers are not considered "good" answers. From [review](https://stackoverflow.com/review). – Trenton McKinney Sep 26 '19 at 18:12