2

Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).

How can be used to find similar articles with a given article?

Should this be done with a resuming tool, like: http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?

javanna
  • 59,145
  • 14
  • 144
  • 125
Vlad Zloteanu
  • 8,464
  • 3
  • 41
  • 58
  • See this [answer](http://stackoverflow.com/questions/5122788/reducing-similar-top-results-in-solr-result-output/5123165#5123165) – Karussell Mar 01 '11 at 12:15

2 Answers2

5

It seems you're looking for the MoreLikeThis feature.

Mauricio Scheffer
  • 98,863
  • 23
  • 192
  • 275
1

What you are trying to do is very similar to the task I outlined in this answer.

In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.

Community
  • 1
  • 1
Joel
  • 29,538
  • 35
  • 110
  • 138