0

There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.

The user can vote on any tweet. So, each tweet has one of the following three states:

relevant (positive vote)

default (neutral i.e. no vote)

irrelevant (negative vote)

Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.

One last thing, a minimum of semantic consideration (natural language processing) would be great.

I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:

Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.

If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.

If (Tweet score > 0) then this tweet will trigger a notification.

Tweet score = sum of all individual words’ scores of this tweet

word score = word frequency * inverse total frequency

word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word

Inverse total frequency = log ( total votes of all words / total votes for this word)

Is this method enough? I am open to any better methods and any ready API or algorithm.

Hichem Acher
  • 433
  • 2
  • 16

2 Answers2

1

One possible solution would be to train a classifier such as Naive Bayes on the tweets that a user has voted on. You can take a look at the documentation of scikit-learn, a Python library, which explains how you can easily preprocess your text and train such a classifier.

yvespeirsman
  • 3,099
  • 20
  • 21
  • Thanks a lot for your feedback. I am gonna read about scikit-learn, meanwhile, since you are an expert in this field, can you please tell me what you think about the tf-idf method the way I suggested ? – Hichem Acher Apr 11 '15 at 10:07
  • Your intuitions about tf-idf are largely correct, but there are some issues with the method you're describing. For example, normalizing by document frequency should work better than your "inverse total frequency" (if I understand it correctly). However, your solution is fairly close to the Naive Bayes approach that you find in the documentation that I linked to above. A scikit-learn pipeline of `CountVectorizer`, `TfidfTransformer`, `MultinomialNB` should get you a working system fairly quickly, so I would stick to that. – yvespeirsman Apr 11 '15 at 10:34
  • I wanna run this classification feature on Google App Engine. However, I've found it's not possible to run scikit-learn on GAE: http://stackoverflow.com/questions/22763165/is-it-possible-to-run-scikit-learn-on-google-app-engine Can you suggest any other tool that works fine with Google App Engine ? – Hichem Acher Apr 14 '15 at 22:02
1

I would look at Naive Bayes, however I would also look at the K-Nearest Neighbours algorithm when performing a simple classification - this is contained within the Sci-kit Learn library and documented well.

RE: "running SKLearn on GAE is not possible" - you will either need to use the Google Predict API, or, run a VPS which would serve as a worker to process your classification tasks; this would obviously have to live on a different system though.

I would say though, if you are only hoping to perform simple classification on a suitably small dataset, you could actually implement a classifier in JavaScript, like

`http://jsfiddle.net/bkanber/hevFK/light/`

With a JS implementation, the processing time will become unacceptably slow if the dataset is too large, but it's nice to have as an option, even preferable in many cases.

Ultimately, GAE is not the platform I would use when building anything which may require all but the most basic of ML techniques. I would look at Heroku or a VPS in such a place as Digital Ocean, AWS et al.

torrange
  • 81
  • 1
  • 5