Text classification - how to approach

Question

I'll try do describe what I have in mind.

There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not.

What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly validated content. Hopefully I explained myself clearly.

I am thinking to use Spark streaming for data classification based on model created. And Naive Bayes algorithm. But how would you approach creating, updating and storing the model? There is ~200K+ validated resultants (texts) of various length. Do I need so many for the model? And how to use this model in Spark Streaming.

Thanks in advance.

Alberto Bonsanto · Accepted Answer · 2016-01-15T14:27:14.900

Wow this question is very broad, and more related to Machine Learning than to Apache Spark, however I will try to give you some hints or steps to follow (I won't do the work for you).

Import all the libraries you need

from pyspark.mllib.classification import LogisticRegressionWithSGD, LogisticRegressionModel
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
import re

Load your data into an RDD

msgs = [("I love Star Wars but I can't watch it today", 1.0),
        ("I don't love Star Wars and people want to watch it today", 0.0),
        ("I dislike not being able to watch Star Wars", 1.0),
        ("People who love Star Wars are my friends", 1.0),
        ("I preffer to watch Star Wars on Netflix", 0.0),
        ("George Lucas shouldn't have sold the franchise", 1.0),
        ("Disney makes better movies than everyone else", 0.0)]

rdd = sc.parallelize(msgs)

Tokenize your data (if you use ML it might be easier) and

rdd = rdd.map(lambda (text, label): ([w.lower() for w in re.split(" +", text)], label))

Remove all unnecessary words (widely known as stop-words), and symbols e.g. ,.&

commons = ["and", "but", "to"]
rdd = rdd.map(lambda (tokens, label): (filter(lambda token: token not in commons, tokens), label))

Create a dictionary with all distinct words in all your dataset, it sounds huge but they are not so many as you would expect, and I bet they will fit in your master node (however there are other ways to approach this, but for simplicity I will keep this way).
```
# finds different words
words = rdd.flatMap(lambda (tokens, label): tokens).distinct().collect()
diffwords = len(words)
```
Convert your features into a DenseVector or SparseVector, I would obviously recommend the second way because normally a SparseVector requires less space to be represented, however it depends on the data. Note, there are better alternatives like hashing, but I am trying to keep loyal to my verbose approach. After that transform the tuple into a LabeledPoint
```
def sparsify(length, tokens):
    indices = [words.index(t) for t in set(tokens)]
    quantities = [tokens.count(words[i]) for i in indices]

    return SparseVector(length, [(indices[i], quantities[i]) for i in xrange(len(indices))])

rdd = rdd.map(lambda (tokens, label): LabeledPoint(label, sparsify(diffwords, tokens)))
```
Fit your favorite model, in this case I used LogisticRegressionWithSGD due ulterior motives.
```
lrm = LogisticRegressionWithSGD.train(rdd)
```
Save your model.
```
lrm.save(sc, "mylovelymodel.model")
```

Load your LogisticRegressionModel in another application.

lrm = LogisticRegressionModel.load(sc, "mylovelymodel.model")

Predict the categories.

lrm.predict(SparseVector(37,[2,4,5,13,15,19,23,26,27,29],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
# outputs 0

Note that I didn't evaluate the accuracy of the model, however it looks very doesn't it?

Thanks you Alberto for so detailed answer. And just in time for Star Wars release :). One question. Do I always need 2 data sets? In my case one validated by data validators and one omitted? Can't I just use one data set? Like i just sentences with Star Wars. — Igor K., Dec 18 '15 at 04:02
You just need a `training` dataset to train the model, but if you want to evaluate the model (accuracy and overfitting) you would need 2 datasets (both labeled) one to train your model and the other to evaluate it, the second one is called `validation` data. After fitting the model you can predict the `label` for any other text after `sparsify` it, nonetheless I recommend you the course of [Machine Learning by Stanford](https://www.coursera.org/learn/machine-learning/) don't worry if you can't afford the certificate as me, you can audit it and learn because it's open to everyone. — Alberto Bonsanto, Dec 18 '15 at 10:04
Hi Alberto. I built the model and now I have another question. How do I apply the model on incoming data once I load model in another program? Let's say a text came in "I watched Star Wars today and it was great". How do I apply the model? I think I need tokenize it and create vectors. But how do I label it? All examples I can find do test on labeled sets. Thanks in advance. — Igor K., Dec 20 '15 at 05:16
Don't forget that you are trying to `classify` new data based on old `labeled` data, therefore you are trying to `predict` based on several `features`, in text classification the features can be the words used in the text. Hence, if you are trying to classify data, it's obvious that they won't come labeled, the model will predict the label, that's how [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) works. I recommend you to audit this course [Scalable Machine Learning](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x) — Alberto Bonsanto, Dec 20 '15 at 11:27
I think I need to transform incoming data into vector and pass it to predict method of the model. If I used TF-IDF method to prepare model, I need to prepare vector the same way. Correct? — Igor K., Dec 20 '15 at 16:09
Yes, both datasets (training and validation) must have the same structure, otherwise the model will be useless — Alberto Bonsanto, Dec 20 '15 at 16:22

Text classification - how to approach

1 Answers1

Linked