Document classification in spark mllib

Question

i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.

score 0 · Accepted Answer · answered Jan 16 '16 at 14:01

0

For text classification, you need:

A word dictionary
Convert document into vector using the dictionary
Label the document vectors:

doc_vec1 -> label1

doc_vec2 -> label2

...

This sample is pretty straghtforward.

answered Jan 16 '16 at 14:01

David S.

10,578
12
62
104

thanks for the answer..but i am a bit confused about the second part: Convert document into vector using the dictionary as if i have labels like (music, politics, entertainment) and top 50 words from each documents . how do i map them to build a classifier. As far spark example they already have the text file which can utilised as input for naive bayes. – decipher Jan 17 '16 at 11:56
a `doc vec` is just a long array, say 50 elements, the index of the array is the id of the word in your dict, and the value of the element is the *count* of the corresponding word. It is suggested you **normalize** the *count*. – David S. Jan 17 '16 at 13:27

score 0 · Answer 2 · answered Sep 26 '19 at 17:36

    from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
    from pyspark.ml.classification import NaiveBayes

    # regular expression tokenizer
    regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", 
    pattern="\\W")
    # stop words
    add_stopwords = ["http","https","amp","rt","t","c","the"] 
    stopwordsRemover = 
  StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
    # bag of words count
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
    vocabSize=10000, minDF=5)
    (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
    nb = NaiveBayes(smoothing=1)
    model = nb.fit(trainingData)
    predictions = model.transform(testData)
    predictions.filter(predictions['prediction'] == 0) \
     .select("Descript","Category","probability","label","prediction") \
     .orderBy("probability", ascending=False) \
     .show(n = 10, truncate = 30)

Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). This provided answer may be correct, but it could benefit from an explanation. Code only answers are not considered "good" answers. From [review](https://stackoverflow.com/review). — Trenton McKinney, Sep 26 '19 at 18:12

Document classification in spark mllib

2 Answers2