There is a stream of short texts. Each one has the size of a tweet, or let us just assume they are all tweets.
The user can vote on any tweet. So, each tweet has one of the following three states:
relevant (positive vote)
default (neutral i.e. no vote)
irrelevant (negative vote)
Whenever a new set of tweets come, they will be displayed in a specific order. This order is determined by the votes of the user on all previous tweets. The aim is to assign a score to each new tweet. This score is calculated based on the word similarity or match between the text of this tweet and all the previous tweets voted by the user. In other words, the tweet with the highest score is going to be the one which contains the maximum number of words voted previously positive and the minimum of words voted previously as negative. Also, the new tweets having a high score will trigger a notification to the user as they are considered very relevant.
One last thing, a minimum of semantic consideration (natural language processing) would be great.
I have read about Term Frequency–Inverse Document Frequency and come up with this very simple and basic solution:
Reminder: a high weight in tf–idf is reached by a high word frequency and a low total frequency of the word in the whole collection.
If the user votes positive on a Tweet, all the words of this tweet will receive a positive point (same thing for the negative case). This means that we will have a large set of words where each word has the total number of positive points and negative points.
If (Tweet score > 0) then this tweet will trigger a notification.
Tweet score = sum of all individual words’ scores of this tweet
word score = word frequency * inverse total frequency
word frequency in all previous votes = ( total positive votes for this word - total negative votes for this word) / total votes for this word
Inverse total frequency = log ( total votes of all words / total votes for this word)
Is this method enough? I am open to any better methods and any ready API or algorithm.