2

I've rather specific question, at least it is so for me. Specific because after doing quite a lot searching I couldn't find anything useful. So as the title says, I am looking for an algorithm, that finds if two articles given in input "match", but not in the sense of usual string matching, instead, what I want to find is, if they talk for the same argument. Now what I predict, the "match" should be compared against some threshold, and using some kind of weights to determine how much do they "match", therefore the concept is fuzzy, so we can't talk about a complete "match", but we will talk about degree of "match".

Sadly, I don't have anything more. I would be really grateful if someone of you helps me in the topic, also theoretical ideas are welcome.

Thanks you.

NiVeR
  • 9,644
  • 4
  • 30
  • 35
  • 1
    Seems like you are describing a text classification problem. In your case, an input is (text1,text2) and the output should be true or false if text1 and text2 are "similar" this is solveable using standard classification solutions - supervised or non-supervised. Which do you prefer? – amit Feb 13 '14 at 14:17
  • 3
    You're after some kind of artificial intelligence that can read an article and determine the point of view of the author? Good luck with that one. It's hard enough for a human! – ᴇʟᴇvᴀтᴇ Feb 13 '14 at 14:17
  • If there is a collection of tags associated with every article you can simply use the ratio of corresponding tags – BlackBear Feb 13 '14 at 14:21
  • I am discouraged right now.. No, there are not tags, just the articles given as strings in input. – NiVeR Feb 13 '14 at 14:23
  • Then maybe you could replace tags with the k most common nouns/verbs appearing in the articles – BlackBear Feb 13 '14 at 14:24
  • Also keywords should be considered I think. Like words mentioned in the title. – NiVeR Feb 13 '14 at 14:26
  • Have a look at https://dandelion.eu/products/datatxt/cl/demo/ ...it's text classification into 12 categories. You could use something like that for your matching, however you define it. Maybe they have more info on their site or in papers on how they do it. – Jakub Kotowski Feb 13 '14 at 14:27
  • 1
    Unclear what you are asking. If it's just for similarity of the topic, you can just compare word frequencies. But determining what's the actual semantic content or the _opinion_ of the author is a whole different matter. – tobias_k Feb 13 '14 at 14:30
  • @tobias_k You compare word frequencies, and what results do you conclude with that? "I want to fly" and "I want to cry" have 3 out of 4 equal words, but they are totally different in meaning. – NiVeR Feb 13 '14 at 14:40
  • 'Word frequencies' was just a (still much simplified) example for the simpler of the two cases. See amit's answer for how to do it right. – tobias_k Feb 13 '14 at 15:11

1 Answers1

6

There are many ways to find 'similarity' of articles, and it really depends on what you know on the articles, and what you use as your test case to show how good your results are.

One simple solution is using Jaccard Similarity on the vocabulary used by these documents. Pseudo code:

similarity(doc1,doc2):
   set1 <- getWords(doc1)
   set2 <- getWords(doc2)
   intersection <- set_intersection(set1,set2)
   union <- set_union(set1,set2)
   return size(intersection)/size(union)

Note that instead of getWords you can use also bigrams,trigrams,...n-grams.


More complex unsupervised solution could be building a language model from each document, and calculate their Jensen-Shannon divergence to judge if they are similar or not, based on the language models.
A simple language model is P(word|document) = #occurances(word,document)/size(document)
Usually we use some smoothing techniques to make sure no word has probability 0.


Other solutions are using supervised learning algorithms such as SVM. Your features can be the words (tf-idf model / bag of words model /...) and use these features to classify if doc1,doc2 are 'similar'. This requires obtaining a 'training set' that is basically a set of samples (doc1,doc2) and lables that tells you if (doc1,doc2) are 'smilar' or not. Feed the training data to a learner and build a model - that will later be used to classify new pairs of documents.

amit
  • 175,853
  • 27
  • 231
  • 333